Sensitivity, specicity, ROC Thomas Alexander Gerds Department of Biostatistics, University of Copenhagen 1 / 53
Epilog: disease prevalence The prevalence is the proportion of cases in the population today. It is easy to estimate: Prev = No. cases No. of subjects Exact condence limits for proportions can be obtained from most statistical software packages. Approximative condence limits are obtained from the formula [ ] Prev 1.96 Prev(1 Prev) Prev(1 Prev) ; Prev +1.96 sample size sample size 2 / 53
Introduction A diagnosis is an estimate of the patient's current status A prediction is an estimate of the patient's future status The estimates can be based on patient's genotype, phenotype and exposure history. 3 / 53
Who is asking the question? The patient wants to know if she or he is diseased The doctor wants to know what treatment to use The researcher wants to know if a new marker is useful The hospital wants to plan its resources 4 / 53
Medical test A medical diagnostic test is a decision rule { 1 positive / disease X = 0 negative / non-disease The test can be based on a biomarker. A biomarker is any biological measurement made on a patient which is related to the disease status, extent, or activity. 5 / 53
Example: screening for prostate cancer The rst commercial PSA 1 test: { 1 positive if PSA > 4.0 ng/ml X = 0 negative if PSA 4.0 ng/ml 1 Prostate Specic Antigen 6 / 53
Example: screening for prostate cancer The rst commercial PSA 1 test: { 1 positive if PSA > 4.0 ng/ml X = 0 negative if PSA 4.0 ng/ml The reference range of serum PSA is 0.04.0 ng/ml (based on a study of 472 healthy men where 99% had a total PSA level below 4 ng/ml). 1 Prostate Specic Antigen 6 / 53
Example: screening for prostate cancer The rst commercial PSA 1 test: { 1 positive if PSA > 4.0 ng/ml X = 0 negative if PSA 4.0 ng/ml The reference range of serum PSA is 0.04.0 ng/ml (based on a study of 472 healthy men where 99% had a total PSA level below 4 ng/ml). There are some that feel that this level should be lowered to 2.5 ng/ml in order to detect more cases of prostate cancer 1 Prostate Specic Antigen 6 / 53
From the Internet 2 University of Michigan researchers identify new blood test for prostate cancer. The test looks at 22 biomarkers; The results are more accurate than PSA These 22 biomarkers appear to be the right number. If you used too many or too few, the accuracy went down a bit. Our ndings held up when we tested the model on an independent set of blood serum samples,... 2 http: //www.eurekalert.org/pub_releases/2005-09/uomh-uri091905.php 7 / 53
Predicted probabilities A prediction rule is a fully specied mathematical function for mapping from patients characteristics to a predicted probability. Patients characteristics may include: conventional predictors such as age, gender, blood pressure, etc. biomarkers and (high dimensional) genetic markers exposure history (until today) treatment The function is described by (estimated) parameters, such as regression coecients and cut-o thresholds. A set of biomarkers or a genes signature is not a prediction model. 8 / 53
Prostate Cancer Risk Calculator Available Online American researchers have developed and released an online calculator to predict a man's risk of developing prostate cancer based on his age, biopsy results, PSA levels, and digital rectal exam results. The original Prostate Cancer Prevention Trial (PCPT) Prostate Cancer Risk Calculator (PCPTRC) posted in 2006 was developed based upon 5519 men in the placebo group of the Prostate Cancer Prevention Trial. All of these 5519 men initially had a prostate-specic antigen (PSA) value less than or equal to 3.0 ng/ml and were followed for seven years with annual PSA and digital rectal examination (DRE). and had at least one prostate biopsy. 9 / 53
Prostate Cancer Risk Calculator Available Online PSA, family history, DRE ndings, and history of a prior negative prostate biopsy provided independent predictive value to the calculation of risk of a biopsy that showed presence of cancer. Disclaimer The calculator is in principle only applicable to men under the following restrictions: Age 55 or older No previous diagnosis of prostate cancer DRE and PSA results less than 1 year old 10 / 53
Prostate Cancer Risk Calculator in action 3 3 http://www.prostate-cancer-risk-calculator.info/ 11 / 53
Prostate Cancer Risk Calculator in action 3 12 / 53
Prostate Cancer Risk Calculator in action 3 13 / 53
Prostate Cancer Risk Calculator in action 3 14 / 53
What is behind the 'Prostate Cancer Risk Calculator' The Prostate Cancer Prevention Trial 4 Here we used prostate biopsy data from 5519 participants in the PCPT to examine whether interactions among these variables (PSA level, family history of prostate cancer, age, race, and digital rectal examination) can be used to predict prostate cancer risk in an individual patient. We used multivariable logistic regression to model the risk of prostate cancer by considering all possible combinations of main eects and interactions. The models chosen were those that minimized the Bayesian information criterion (BIC) and maximized the average out-of-sample area under the receiver operating characteristic curve (via 4-fold cross-validation). 4 Thompson et al. J Natl Cancer Inst, 98(8):529-34, 2006. 15 / 53
Notation for Binary Markers [Y : ] Outcome (disease status) E.g. coronary heart disease [X : ] Test result (biomarker) E.g. exercise stress test) Y = 1 Y = 0 X = 0 False negative True negative X = 1 True positive False positive 16 / 53
Evaluation of Binary Markers To what extent does a biomarker reect true disease status? [True positive rate: ] TPR = P(X = 1 Y = 1) [False positive rate: ] FPR = P(X = 1 Y = 0) Terminology: TPR = sensitivity, FPR = 1 specicity Ideal tests have FPR = 0 and TPR = 1, but usually both error rates have to be optimized simultaneously. 17 / 53
Estimating TPR and FPR Use a case control study if disease prevalence is low: No. controls with positive test FPR = No. controls No. cases with postive test TPR = No. cases Condence intervals are either obtained exactly or via: FPR ± 1.96 (1/n case ) FPR TPR ± 1.96 (1/n control ) TPR ( 1 FPR ) ( 1 TPR ) 18 / 53
Example: Coronary Artery Surgery Study 5 Y : Coronary heart disease status X : Exercise Stress Test Y = 0 Y = 1 X = 0 327 208 X = 1 115 815 FPR = 115/(115 + 327) = 0.26, TPR = 815/(208 + 815) = 0.80 5 Data from Table 2 of: Weiner DA, Ryan TJ, McCabe CH, et al. Correlations among history of angina, ST-segment response and prevalence of coronary-artery disease in the Coronary Artery Surgery Study (CASS). NEJM 301(5):230-235. 1979. 19 / 53
Continuous Markers In many clinical applications, biomarkers are continuous (Example: Prostate Specic Antigen (PSA) for prostate cancer) For any given cut-o value c, we may dene a test Y = 0 Y = 1 X c False Positve True Positive X < c True Negative False Negative Classication accuracy: FPR(c) = P(X c Y = 0), TPR(c) = P(X c Y = 1) 20 / 53
Example: Pancreatic Cancer Study Antigens, CA-125 and CA 19-9 are possible biomarkers of pancreatic cancer 6 Distribution of CA125 among cases and controls Trade-o between: Increase c FPR and TPR Decrease c FPR and TPR 6 Wieand et al (1989) studied n case = 90 patients with pancreatic cancer and n control = 51 healthy patients with pancreatitis 21 / 53
Estimation of FPR(c) and TPR(c) FPR(c) = TPR(c) = No. controls with X c No. controls No. cases with X c No. cases 22 / 53
The Receiver Operating Characteristic (ROC) Curve: plots TPR(c) against FPR(c) for all dierent cut-o values c. True positive rate = Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 Perfect test Actual test No predictive value 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate = 1 Specificity 23 / 53
Comparing markers The ROC curve is invariant under monotone transformations of the marker (e.g. same ROC curve for X and log(x ) and 4.13 X ) ROC Curves of CA 19-9 and CA 125 Useful for comparing dierent markers All markers are put on the same scale The ideal ROC curve approaches the point (0, 1). 24 / 53
Area under the curve (AUC) True positive rate = Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 Perfect test Actual test No predictive value 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate = 1 Specificity The higher the better. Lower benchmark: 0.5 (coin toss) Upper benchmark: 1.0 (perfect discrimination) AUC is also known as the c-statistic. Methods for analyzing AUC are well established: E.g. Bamber (1975); DeLong, DeLong and Clarke-Pearson (1988) 25 / 53
Estimating AUC AUC can be estimated based on case control pairs: ÂUC = No. (case, control)-pairs with X case X control No. (case, control)-pairs AUC of CA 19-9 and CA 125 ÂUC = 0.86 for CA 19-9. ÂUC = 0.71 for CA 125. 26 / 53
Limitations of AUC Two ROC curves may cross, but this cannot be seen from the AUC TPR 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR 27 / 53
Break 28 / 53
Limitation of ROC analysis The classication accuracy measures are good: for describing the capacity a marker has in distinguishing a diseased subject from a non-diseased subject. in the discovery stage when interest lies in identifying markers for disease diagnosis and prognosis. In clinical practice, patients and clinicians are often more interested in knowing: How likely is it that the patient is truly diseased if the test is positive? How likely is it that the patient is truly disease free if the test is negative? 29 / 53
Predictive Values The positive predictive value: PPV= P(Y = 1 X = 1) = The probability that a patient with a positive test is diseased The negative predictive value: NPV= P(Y = 0 X = 0) The probability that a patient with a negative test is not diseased A perfect test has PPV = 1 and NPV = 1. A useless test has PPV = Prev and NPV = 1- Prev 7, 8 7 Similarly for a continuous marker X : PPV(c) = P(Y = 1 X c), NPV(c) = P(Y = 0 X < c) 8 where Prev= disease prevalence 30 / 53
Estimation of predictive values In a cohort study the PPV(c) and NPV(c) can be estimated directly as: PPV(c) = No. cases among subjects with X c No. subjects with X c NPV(c) = No. controls among subjects with X < c No. subjects with X < c In a case-control study PPV and NPV cannot be estimated (by using this formula). 31 / 53
Relation to TPR and FPR The clinical interpretations of PPV and NPV are dierent from that of FPR and TPR. But the values are closely related via the prevalence and Bayes' theorem: PPV(c) = TPR(c) Prev TPR(c) Prev +FPR(c)(1 Prev) NPV(c) = {1 FPR(c)}(1 Prev) {1 FPR(c)}(1 Prev) + {1 TPR(c)} Prev 32 / 53
Example: epo study 9 Anaemia is a deciency of red blood cells and/or hemoglobin and an additional risk factor for cancer patients. Randomized placebo controlled trial: does treatment with epoetin beta epo (300 U/kg) enhance hemoglobin concentration level and improve survival chances? Henke et al. 2006 identied the c20 expression (erythropoietin receptor status) as a new biomarker for the prognosis of locoregional progression-free survival. 9 Henke et al. Do erythropoietin receptors on cancer cells explain unexpected clinical ndings? J Clin Oncol, 24(29):4708-4713, 2006. 33 / 53
Treatment The study includes 149 10 head and neck cancer patients with a tumor located in the oropharynx (36%), the oral cavity (27%), the larynx (14%) or in the hypopharynx (23%). One of the treatments was radiotherapy following Resection Complete Incomplete No Placebo 35 14 25 Epo 36 14 25 10 with non-missing blood values 34 / 53
Outcome Blood hemoglobin levels were measured weekly during radiotherapy (7 weeks). Treatment with epoetin beta was dened successful when the hemoglobin level increased suciently. For patient i set Y i = { 1 treatment successful 0 treatment failed 35 / 53
Target Patient no. Treatment successful Predicted probability 1 0 P 1 2 0 P 2 3 1 P 3 4 1 P 4 5 0 P 5 6 1 P 6 7 1 P 7 36 / 53
Predictors Age min: 41 y, median: 59 y, max: 80 y Gender male: 85%, female: 15% Baseline hemoglobin mean: 12.03 g/dl, std: 1.45 Treatment epo: 50%, placebo 50% Stratum complete: 48%, incomplete: 19%, no resection: 34% Erythropoietin receptor status neg: 32%, pos: 68% 37 / 53
Logistic regression Response: treatment successful yes/no Factor OddsRatio StandardError CI.95 pvalue (Intercept) 0.00 4.01 < 0.0001 Age 0.97 0.03 [0.91; 1.03] 0.2807 Sex:female 4.71 0.84 [0.91; 26.02] 0.0657 HbBase 3.25 0.27 [1.99; 5.91] < 0.0001 Treatment:Epo 90.92 0.76 [23.9; 493.41] < 0.0001 Resection:Incompl 1.75 0.81 [0.36; 9.03] 0.4924 Resection:Compl 4.14 0.69 [1.13; 17.36] 0.0395 Receptor:positive 5.81 0.66 [1.72; 23.39] 0.0076 38 / 53
The model provides general information Treatment with epo increases the chance (odds) of reaching the target hemoglobin level signicantly by a factor of 90.92 (CI 95% : [23.9; 493.4], p < 0.0001) in the overall study population. Does that mean everyone should be treated? 39 / 53
The model provides information for a single patient For example: the predicted probability that a 51 year old man with complete tumor resection and baseline hemoglobin level 12.6 g/dl reaches the target hemoglobin level (Y i =1) is [Epo group: ] 97.4% [ Placebo: ] 29.2 % If a similar patient has baseline hemoglobin level 14.8 g/dl then the model predicts: [Epo group: ] 99.8% [Placebo: ] 84.7 % 40 / 53
Predicted treatment success probability For a treated man with no resection possible and negative epo receptor status. Predicted risk 100% 14 90% 80% 13 70% Baseline hemoglobin (g/dl) 12 11 60% 50% 40% 10 30% 9 20% 10% 50 60 70 Age (years) 0% 41 / 53
The model behind the table log ( Pi ) = β 0 + β 1 x 1,i + + β k x k,i 1 P i P i = 1 1 + exp{β 0 + β 1 x 1,i + + β k x k,i } P i the probability of successful treatment x 1,i rst predictor for subject i: (e.g. age = 50) x 2,i second predictor for subject i: (e.g. gender = male) x k,i k'th predictor for subject i: (e.g. eporeceptor = pos) β 0,..., β k are regression coecients that are estimated based on the epo study 42 / 53
Logistic regression in R data(epo) Epo$Y <- as.numeric(epo$hbsuccess==1) ## fit the model via glm glmfit <- glm(y~age+sex+hbbase+arm+resection+eporec, data=epo,family="binomial") ## predicting the chance of successful treatment predict(glmfit,type="response",newdata=epo) predict(glmfit,type="response", newdata=data.frame(hbbase=13.4, age=54, sex="male", arm="epo", Resection="Incomplete", eporec="positive")) 43 / 53
Evaluation of prediction models Idea: use the predicted probability P for a positive outcome as a continuous "marker": P c P < c Y = 0 False Positve True Negative Y = 1 True Positive False Negative 44 / 53
Assessing the added value of marker Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 AUC=0.947 AUC=0.933 LRM (full) LRM (without eporec) 0.0 0.2 0.4 0.6 0.8 1.0 1 Specificity 45 / 53
Assessing the added value of marker Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 AUC=0.947 AUC=0.796 LRM (full) LRM (without treatment arm) 0.0 0.2 0.4 0.6 0.8 1.0 1 Specificity 46 / 53
Assessing the predictive power of the model A residual is dened as the dierence between: the true disease status of a patient the probability of disease for this patient as predicted by the model Residual(Patient i) = Status(i) Predicted probability(i) = Y i P i 47 / 53
Assessing the predictive power of the model Patient Treatment Predicted no. successful probability (%) Residual Y i P i Y i P i 1 0 2.31-2.31 2 0 1.91-1.91 3 1 98.11 1.89 4 1 79.58 20.42 5 0 4.21-4.21 6 1 98.81 1.19 7 1 64.72 35.28 48 / 53
Brier score The Brier score is the squared dierence between a patient's status and the predicted probability for this patient. The mean Brier score is a useful summary of the accuracy of the model: B = 1 N N (Y i P i ) 2 = 1 N {(Y 1 P 1 ) 2 +... (Y N P N ) 2 } i=1 where N is the sample size. The ideal model has mean Brier score equal to 0 (perfect prediction) 49 / 53
Assessing the predictive power of the model Patient Treatment Predicted Brier no. successful probability (%) Residual score Y i P i Y i P i (Y i P i ) 2 142 0 84.09-84.09 0.7071 143 0 93.47-93.47 0.8737 144 0 18.73-18.73 0.0351 145 0 1.81-1.81 0.0003 146 0 3.86-3.86 0.0015 147 1 96.64 3.36 0.0011 148 0 0.5-0.5 < 0.0001 149 0 11.93-11.93 0.0142 Σ 0.0869 50 / 53
Comparison to a model that ignores all covariates Patient Treatment Predicted Brier no. successful probability (%) Residual score Y i P i Y i P i (Y i P i ) 2 142 0 44.3-44.3 0.1962 143 0 44.3-44.3 0.1962 144 0 44.3-44.3 0.1962 145 0 44.3-44.3 0.1962 146 0 44.3-44.3 0.1962 147 1 44.3 55.7 0.3103 148 0 44.3-44.3 0.1962 149 0 44.3-44.3 0.1962 Σ 0.247 51 / 53
Improving the predictive power New predictor variables Variable selection Dierent link function Systematically searching for the model that optimizes the predictive power Machine learning tools 52 / 53
Take home messages Sensitivity and specicity and the predictive values are important for medical practice. The ROC curve is useful for summarizing the discriminatory capacity of a marker or a model. Summary measures like AUC and Brier score are useful for comparing the overall accuracy of diagnostic and prognostic models. 53 / 53