It s hard to predict!

Size: px

Start display at page:

Download "It s hard to predict!"

Bethany Parks
5 years ago
Views:

Statistical Methods for Prediction Steven Goodman, MD, PhD With thanks to: Ciprian M. Crainiceanu Associate Professor Department of Biostatistics JHSPH 1 It s hard to predict!

1 Statistical Methods for Prediction Steven Goodman, MD, PhD With thanks to: Ciprian M. Crainiceanu Associate Professor Department of Biostatistics JHSPH 1 It s hard to predict! People with no future: Marilyn Monroe and Elvis Presley. Useless or impossible technologies: Telephones, light bulbs, radio, TV, rockets, atomic bombs, X-rays, space flight, portable computers. Barack Obama s NCAA Predictions Risk Prediction Models What do we want risk prediction models to do? Randolph et al. (1998, Crit Care Med) Prediction tools only inform decisions if they help us differentiate between those patients with higher risks of encountering the outcome from those patients with lower risks Moons and Harrell (2005, Academic Radiology): Ultimately the patient and his physician want to know his risk of disease Cook (2007, Circulation) Most important for clinical risk prediction is whether a model can more accurately stratify individuals into higher or lower risk categories of clinical importance These descriptions involve: Discriminatory accuracy; Calibration; Predictive accuracy 4 1

2 Goal of these models Not!!!... just to distinguish between diseased and non-diseased patients, or those who will and won t prevent disease. To improve the overall health outcomes (or reduce cost or suffering) in a group in whom it is applied. To do more good than harm. Differences from explanatory or etiologic epi Subject Test setting Test Result Action Consequences Predictive model/marker studies Prediction is also concerned with individual classification, NOT statistical distinction of group averages. Steps of Statistical Prediction Modeling Design Calibration, Discrimination Validation Assessment Incremental value Reclassification 8 2

3 Many ways of constructing prediction models Statistical modeling Regression, Regression Trees Random Forests Classification trees Clustering Data Mining Neural Networks Machine learning Support Vector Machines It (almost) doesn t matter how you build your model if you validate it properly! Examples of prediction Framingham risk score / CHD Sex, age, T. Chol., HDL, BP, Diabetes, Smoking Gail risk model / Breast cancer Personal history, age, age at menarche, age at first live birth, family history, number of biopsies, race Polls election, market preferences, etc. Markets pay per prediction 9 10 Regression line Regression line and confidence interval β 0 = 1.35 β 1 = 1.66 se(β 1 ) = 0.75 p-value = 0.06 r = 0.62 R 2 = 0.38 = β 0 = 1.35 β 1 = 1.66 se(β 1 ) = 0.75 p-value = 0.06 r = 0.62 R 2 = 0.38 = l(ci) =

Regression line, confidence interval, prediction interval β 0 = 1.35 β 1 = 1.66 se(β 1 ) = 0.75 p-value = 0.06 r = 0.62 R 2 = 0.38 = 0.62 2 l(ci) = 3.62 L(PI) = 9.

4 Regression line, confidence interval, prediction interval β 0 = 1.35 β 1 = 1.66 se(β 1 ) = 0.75 p-value = 0.06 r = 0.62 R 2 = 0.38 = l(ci) = 3.62 L(PI) = 9.03 Regression line, confidence interval, prediction interval β 0 = 1.35 β 1 = 1.66 se(β 1 ) = 0.75 p-value = 0.06 r = 0.62 R 2 = 0.38 = l(ci) = 3.62 L(PI) = Regression line, confidence interval, prediction interval β 0 = 0.71 β 1 = 1.87 se(β 1 ) = 0.24 p-value < r = 0.49 R 2 = 0.24 = l(ci) = 0.75 L(PI) = 7.72 Regression line, confidence interval, prediction interval β 0 = 0.88 β 1 = 1.53 se(β 1 ) = 0.08 p-value < r = 0.42 R 2 = 0.18 = l(ci) = 0.23 L(PI) =

5 Length of confidence and prediction intervals Lessons Length of CI uncertainty about population average response uncertainty about the predicted value Length of PI uncertainty about individual response (outcome) Statistical significance prediction relevance Length of CI << Length of PI Statistical significance and prediction They are different Statistical significance is a weak surrogate for goodness of prediction Statistical significance Can be used to screen for potential confounders Covariates that are not significant probably do not have much prediction power Prediction in binary regression Outcome is 0-1 Examples Non-diseased / diseased Alive / Dead Failure / Success (procedure) Type of binary regression logistic (log odds of success/failure)

6 Sensitivity and specificity of binary prediction rules Sensitivity: P( prediction = 1 outcome = 1) Estimated as Estimators depend on threshold FN TP Sens. = TP / (TP + FN) Specificity: P( prediction = 0 outcome = 0) Estimated as Spec. = TN / (TN + FP) 22 TN FP 23 Sensitivity and specificity curves (red=specificity) Receiver Operating Characteristic (ROC) Curve

7 Area under the ROC curve (AUC) Probability that given two subjects, one who will develop an event and one who will not, the model will assign a higher probability of an event to the former One of the main criteria for assessing discrimination accuracy AUC=0.68 (in the example) Steps of Statistical Prediction Modeling Design Calibration Validation Replication Extrapolation Refinement and Adaptation Calibration, aka Clinical validity Calibration How well the observed outcomes agree with the predicted outcomes Most model fitting strategies select the model that is best calibrated to the data In Statistics Calibration is called Model Fit In clinical prediction, the ability to predict prognosis is called clinical validity. The ability to predict response to therapy is called either clinical utility or predictive ability (as opposed to prognostic )

8 Measures of Calibration Example: SUPPORT Study Goodness of fit statistics E.g., Hosmer-Lemeshow statistic, comparing observed and expected outcomes w/in quantiles. Ad hoc comparisons of observed and expected outcomes Develop model to predict risk of death for seriously ill hospitalized adults, to assist physicians in clinical decision making Cox regression model fit using prospective study of 4301 hospitalized adults with at least 1 of 9 illnesses and expected 6 mo. mortality of 50% Predictors: disease category, severity of acute disease as measured by physiologic abnormalities, evaluation of the patient's long-term health status, age, comorbid conditions, number of days hospitalized before study entry collected 3 days after study entry Validation in 4028 independent patients Calibration of SUPPORT Model Example: Calibration of Model Predicting Survival in Cirrhotic Patients From Knaus et al The SUPPORT Prognostic Model: Objective Estimates of Survival for Seriously Ill Hospitalized Adults. Ann Int Med 122: Shows agreement between observed (step-function) and predicted (smooth function) survival, based on Cox regression model From Guardiola et al External validation of a prognostic model for predicting 33 survival of cirrhotic patients with refractory ascites. Am J Gastroenterol 97:

9 Discrimination Discrimination How well does the risk prediction model separate the two groups? Want separated risk distributions for the two groups Discrimination vs. Calibration Measures of Discrimination Suppose the risk of death in 5 years is 50% a model that assigns a risk of 50% for the entire population is perfectly calibrated, but has no discrimination. a model that assigns all cases 11% risk and all controls 10% risk perfectly discriminates but is poorly calibrated. Models typically cannot be perfect in both We always want a model that is well calibrated, but discrimination (and predictive accuracy) more directly relate to the clinical utility of the model TPR/FPR, ROC curve Time-dependent versions for time-varying outcomes C-index (concordance statistic) For binary outcomes, C-index = Area under ROC curve (AUC) = probability case has higher risk score than control For time-varying binary outcomes, C-index = probability that person with earlier event has higher risk score Very popular, but little clinical relevance Misclassification rate

10 Example: Gail Model for Breast Cancer Risk Discriminatory accuracy assessed by Rockhill et al. (2001, JNCI) 82,109 white women aged from Nurses Health Study, Focus on invasive breast cancer within the 5 year period 38 From Rockhill et al Validation of the Gail et al. model of breast cancer risk prediction and implications for chemoprevention. JNCI 93: ROCs for constant OR ORs and Logistic regression From Pepe, et. al, Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker AJE, 2004 FIGURE 2. Probability distributions of a marker, X, in cases (solid curves) and controls (dashed curves) consistent with the logistic model logit P( D = 1 X) = α + β X. It has been assumed that X has a mean of 0 and a standard deviation of 0.5 in controls so that a unit increase represents the difference between the 84th and 16th percentiles of X in controls. The marker is normally distributed, with the same variance in cases. The odds ratio (OR) per unit increase in X is shown. 10

11 Internal vs. External Validation VALIDATION Avoiding Overoptimism When Calculating Model Accuracy Internal validation: evaluate the data by carefully using the current data To evaluate the accuracy of the model in the current setting Our focus today External validation: Use different dataset E.g., a different geographic location, different study population, different time period To determine generalizability What are you validating??? The whole model fitting process, including the mathematical form of the model? The predictors in the model? The coefficients of the predictors in the model? Whatever you are validating, you need to freeze that in the validation/test set. But be very careful about what you are not validating. 44 The Simplest Approach: Training/ Test Split Randomly divide the data into two parts Use one part to fit the model, the other to validate it E.g., 2/3 training; 1/3 test 45 11

12 Training/Test Split, cont d Useful when: Fitting the model is very computationally intensive or involves decisions which cannot be automated Large dataset Downside: Inefficient use of data (we d like to use all the data to fit the model) K-Fold Cross-Validation Randomly split the data into K parts, e.g., 6 parts: Train Train Test Train Train Train For the k th part (3 rd part above), fit the model to the K-1 other parts, and use the model to predict outcomes on the k th part Do this for k = 1,2,,K, each time obtaining a new model and filling in predicted values for part of the data Calculate model accuracy using these predicted values Note that each observation has one predicted value, obtained from a model which was fit without that observation Summary Using the same data to fit and validate the risk prediction model leads to an overly optimistic estimate of model accuracy Most common approach to correcting for overoptimism: training/test split More sophisticated approaches: cross-validation and bootstrapping More efficiently use the data Key issue is determining which model to use in practice A prediction rule does not merit the name without proper validation. Aspects of Model Performance Calibration: how well does the model fit the data (bias) Discrimination: how well does the model discriminate between the two groups of individuals (ordering) Incremental value: How much does it improve on what we already knew? Clinical Utility: How does use of the model impact clinical outcomes?

13 Incremental Value Incremental Value The increase in classification accuracy when additional information is added to the model NOT: The magnitude of the new coefficient The statistical significance of the new coefficient Easy to find examples of factors that are strongly associated, but do not impact classification accuracy Example: Predicting Pancreatic Cancer Risk Incremental Value of CA-125 Solid line: ROC curve for CA-19-9 Dashed line: ROC curve for risk model From Pepe et al Limitations of the odds ratio for gauging the performance of a diagnostic, prognostic, or screening marker. Am J 52 Epidemiol 159: From Pepe et al Limitations of the odds ratio for gauging the performance of a diagnostic, prognostic, or screening marker. Am J 53 Epidemiol 159:

14 Incremental Value of CA-125, cont d ROC curves show improvement in classification accuracy with CA-125 Change in risk distribution not shown in this figure (see next 2 examples) Example: Utility of CRP in Predicting Cardiovascular Risk Cook et al. (2006, Ann Int Med) Develop cardiovascular risk prediction model, with and without CRP non-diabetic women in the Women s Health Study nationwide cohort of 45 years and older, free of cancer and CVD at entry Followed annually for development of CVD (average 10 years) CRP Example, cont d Cox regression model with covariates: age, CRP, HDL, total cholesterol, SBP, antihypertensive use, current smoking measured at baseline 56 Model Without CRP The Net Benefit of CRP Model With CRP < 5% 5% to 10% 10% to 20% > 20% Total n < 5% n n n % to 10% n n n % to 20% n n n > 20% n n n Total n n

15 Discriminatory Accuracy If risk > 20% suggests intervention: Without CRP: 55/725 (7.6%) of cases identified 154/26202 (0.6%) of controls falsely identified Discriminatory Accuracy If risk > 10% suggests intervention: Without CRP: 146/725 (20.1%) cases identified 866/26202 (3.3%) controls falsely identified With CRP: 57/725 (7.9%) of cases identified 162/26202 (0.6%) of controls falsely identified With CRP: 171/725 (23.6%) cases identified 944/26202 (3.5%) controls falsely identified Summary: Evaluating Incremental Value The improvement in classification when additional information is added to the model NOT: The magnitude of the new coefficient The statistical significance of the new coefficient The improvement in health status with the new model. 60 How Much Do SNPs Improve Models to Predict Breast Cancer Risk? Gail model slides provided by Mitchell H. Gail, Biostatistics Branch, Division of Cancer Epidemiology and Genetics 61 15

16 Breast Cancer Risk Assessment Tool (BCRAT) The NCI s BCRAT or Gail Model 2 Risk factors in BCRAT Age Age at first live birth Age at menarche Number of mother/sisters with breast cancer Number of previous benign breast biopsies and whether atypical hyperplasia present on any Well calibrated Discriminatory accuracy modest BRCA 1 and 2 BRCA1 and 2 (Breast cancer 1 and 2) are human tumor suppressor genes 13.2 % percent (132 out of 1,000) of women in the general population will develop breast cancer 36 to 85% ( out of 1,000) of women with an altered BRCA1 or BRCA2 gene develop breast cancer Gene mutations are rare in the population: % in cases: 1-2% Thus, one needs to look for genes that are predictive of breast cancer, but have higher alleles frequencies SNPs Associated with Breast Cancer Location Disease Allele Frequency Odds Ratio per Allele Reference FGFR TNRC9 (or TOX3) MAP3K LSP CASP q q Prob ( r>t) in cases ROC-type Plots BCRAT + 7 BCRAT 7 Geometric mean Easton et al., Nature 2007;447: Cox et al., Nature Genetics 2007;39: Stacey et al., Nature Genetics 2007;39: Prob ( r>t) in general population 65 16

17 Conclusions Very modest public health improvements from BCRATplus7 for Discriminatory accuracy (AUC) (4.1%) Deciding whether to take tamoxifen (0.1% or 0.8%) Deciding to have mammogram (0.8% or 0.1%) Allocating scarce mammogram resources (5.5%) Reclassification versus BCRAT useful for individuals if BCRATplus7 is well calibrated BCRATplus7 needs to be validated in independent cohort data on individuals 66 Conclusions (continued) Usefulness of SNPs depends on the application, validity of model, and costs To achieve high discriminatory accuracy (AUC=0.8) would require hundreds of SNPs, optimistically. 67 Evaluating the added predictive ability of a new marker: Reclassification Slides based on the paper Pencina et al., Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine 68 Net reclassification improvement (NRI) Same setting as for re-classification tables Classify predicted risks using two prediction models Calculate the net reclassification improvement (NRI) Ideas Consider subjects who develop and do not develop events separately For event subjects Any upward movement in categories improves classification Any downward movement in categories worsens classification For non-event subjects Any upward movement in categories worsens classification Any downward movement in categories improves classification 69 17

18 NRI definition Contribution to NRI from events = (29-7) / 183 =0.12 NRI = P(up D=1) P(down D=1) (P(up D=0) P(down D =0)) D is the disease event Estimators: 70 Downward moves in events 4+3=7 Upward moves in events 15+14=29 71 Contribution to NRI from events = ( ) / 3081 = Issues related to NRI Downward moves in Upward moves in nonevents =174 nonevents =17372 NRI = (29-7) / ( ) / 3081 = 0.12 ( ) 0.12 NRI depends heavily on re-classification in events NRI tends to depend less on re-classification in non-events Same importance for up and down classification Same importance for events and non-events Same importance for 1 or 2 category jumps 73 18

19 Summary of prediction modeling Predicting the outcomes of individuals is hard. It is hard for genetic predictors or biomarkers to add materially to predictions based on clinical factors. It is even harder for prediction models to make an difference in patient outcomes. Validate, validate, validate!

Development, validation and application of risk prediction models

Development, validation and application of risk prediction models G. Colditz, E. Liu, M. Olsen, & others (Ying Liu, TA) 3/28/2012 Risk Prediction Models 1 Goals Through examples, class discussion, and