Cross-validated AUC in Stata: CVAUROC

Similar documents
Cross-validation. Miguel Angel Luque Fernandez Faculty of Epidemiology and Population Health Department of Non-communicable Diseases.

Cross-validation. Miguel Angel Luque Fernandez Faculty of Epidemiology and Population Health Department of Non-communicable Disease.

Notes for laboratory session 2

Selection and Combination of Markers for Prediction

Sociology Exam 3 Answer Key [Draft] May 9, 201 3

Biostatistics II

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see

Multivariate dose-response meta-analysis: an update on glst

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Applying Machine Learning Methods in Medical Research Studies

Outline. Introduction GitHub and installation Worked example Stata wishes Discussion. mrrobust: a Stata package for MR-Egger regression type analyses

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way:

Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

Least likely observations in regression models for categorical outcomes

Assessing Studies Based on Multiple Regression. Chapter 7. Michael Ash CPPA

Sensitivity, Specificity and Predictive Value [adapted from Altman and Bland BMJ.com]

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

Up-to-date survival estimates from prognostic models using temporal recalibration

Daniel Boduszek University of Huddersfield

Multiple Linear Regression Analysis

MIDAS RETOUCH REGARDING DIAGNOSTIC ACCURACY META-ANALYSIS

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

3. Model evaluation & selection

Statistical reports Regression, 2010

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

Daniel Boduszek University of Huddersfield

Overview. All-cause mortality for males with colon cancer and Finnish population. Relative survival

Chapter 17 Sensitivity Analysis and Model Validation

Estimating average treatment effects from observational data using teffects

BMI 541/699 Lecture 16

MS&E 226: Small Data

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

< N=248 N=296

validscale: A Stata module to validate subjective measurement scales using Classical Test Theory

Pattern of Comorbidities among Colorectal Cancer Patients in Spain

Estimating and Modelling the Proportion Cured of Disease in Population Based Cancer Studies

Immortal Time Bias and Issues in Critical Care. Dave Glidden May 24, 2007

Statistics, Probability and Diagnostic Medicine

What is Regularization? Example by Sean Owen

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism

Analysis of TB prevalence surveys

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Math 215, Lab 7: 5/23/2007

Assessment of a disease screener by hierarchical all-subset selection using area under the receiver operating characteristic curves

4. Model evaluation & selection

DISEASE-SPECIFIC REFERENCE EQUATIONS FOR LUNG FUNCTION IN PATIENTS WITH CYSTIC FIBROSIS

Sensitivity, specicity, ROC

Generalized Mixed Linear Models Practical 2

Regression Discontinuity Analysis

Quantifying the added value of new biomarkers: how and how not

Sarwar Islam Mozumder 1, Mark J Rutherford 1 & Paul C Lambert 1, Stata London Users Group Meeting

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

Modeling unobserved heterogeneity in Stata

Evidence-Based Assessment: From simple clinical judgments to statistical learning

Final Exam. Thursday the 23rd of March, Question Total Points Score Number Possible Received Total 100

HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN PROC LOGISTIC AND PROC GENMOD

4 Diagnostic Tests and Measures of Agreement

Before we get started:

Today. HW 1: due February 4, pm. Matched case-control studies

Introduction to regression

Small Sample Bayesian Factor Analysis. PhUSE 2014 Paper SP03 Dirk Heerwegh

Final Exam - section 2. Thursday, December hours, 30 minutes

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

NOVEL BIOMARKERS PERSPECTIVES ON DISCOVERY AND DEVELOPMENT. Winton Gibbons Consulting

Implementing double-robust estimators of causal effects

Learning from data when all models are wrong

Assessment of performance and decision curve analysis

Cover Page. The handle holds various files of this Leiden University dissertation

Diagnostic Test of Fat Location Indices and BMI for Detecting Markers of Metabolic Syndrome in Children

! Mainly going to ignore issues of correlation among tests

ethnicity recording in primary care

MODEL SELECTION STRATEGIES. Tony Panzarella

Item Analysis: Classical and Beyond

HIV risk associated with injection drug use in Houston, Texas 2009: A Latent Class Analysis

Predicting Breast Cancer Survival Using Treatment and Patient Factors

METHODS FOR DETECTING CERVICAL CANCER

The Pretest! Pretest! Pretest! Assignment (Example 2)

Index. Springer International Publishing Switzerland 2017 T.J. Cleophas, A.H. Zwinderman, Modern Meta-Analysis, DOI /

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

Strategies for handling missing data in randomised trials

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

An application of the new irt command in Stata

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Benchmark Dose Modeling Cancer Models. Allen Davis, MSPH Jeff Gift, Ph.D. Jay Zhao, Ph.D. National Center for Environmental Assessment, U.S.

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA

Model evaluation using grouped or individual data

Regression Output: Table 5 (Random Effects OLS) Random-effects GLS regression Number of obs = 1806 Group variable (i): subject Number of groups = 70

Multi-state survival analysis in Stata

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

Discrimination and Reclassification in Statistics and Study Design AACC/ASN 30 th Beckman Conference

Classical Psychophysical Methods (cont.)

Quantifying cancer patient survival; extensions and applications of cure models and life expectancy estimation

Transcription:

Cross-validated AUC in Stata: CVAUROC Miguel Angel Luque Fernandez Biomedical Research Institute of Granada Noncommunicable Disease and Cancer Epidemiology https://maluque.netlify.com 2018 Spanish Stata Conference 24 October 2018 MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 1 / 28

Contents 1 Cross-validation 2 Cross-validation justification 3 Cross-validation methods 4 cvauroc 5 References MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 2 / 28

Cross-validation Definition Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice (note: performance = model assessment). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 3 / 28

Cross-validation Definition Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice (note: performance = model assessment). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 3 / 28

Cross-validation Applications However, cross-validation can be used to compare the performance of different modeling specifications (i.e. models with and without interactions, inclusion of exclusion of polynomial terms, number of knots with restricted cubic splines, etc). Furthermore, cross-validation can be used in variable selection and select the suitable level of flexibility in the model (note: flexibility = model selection). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 4 / 28

Cross-validation Applications However, cross-validation can be used to compare the performance of different modeling specifications (i.e. models with and without interactions, inclusion of exclusion of polynomial terms, number of knots with restricted cubic splines, etc). Furthermore, cross-validation can be used in variable selection and select the suitable level of flexibility in the model (note: flexibility = model selection). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 4 / 28

Cross-validation Applications MODEL ASSESSMENT: To compare the performance of different modeling specifications. MODEL SELECTION: To select the suitable level of flexibility in the model. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 5 / 28

Cross-validation Applications MODEL ASSESSMENT: To compare the performance of different modeling specifications. MODEL SELECTION: To select the suitable level of flexibility in the model. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 5 / 28

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 6 / 28

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ Y = f (x) + ɛ MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 6 / 28

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ Y = f (x) + ɛ MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 6 / 28

MSE Expectation E(Y X 1 = x 1, X 2 = x 2, X 3 = x 3 ) MSE E[(Y ˆf (X)) 2 X = x] MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 7 / 28

MSE Expectation E(Y X 1 = x 1, X 2 = x 2, X 3 = x 3 ) MSE E[(Y ˆf (X)) 2 X = x] MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 7 / 28

Bias-Variance Trade-off Error descomposition MSE = E[(Y ˆf (X)) 2 X = x] = Var(ˆf (x 0 )) + [Bias(ˆf (x 0 ))] 2 + Var(ɛ) Trade-off As flexibility of ˆf increases, its variance increases, and its bias decreases. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 8 / 28

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Choosing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 9 / 28

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Choosing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. Rule More flexibility increases variance but decreases bias. Less flexibility decreases variance but increases error. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 9 / 28

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Choosing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. Rule More flexibility increases variance but decreases bias. Less flexibility decreases variance but increases error. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 9 / 28

Bias-Variance trade-off Regression Function MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 10 / 28

Overparameterization George E.P.Box,(1919-2013) Quote, 1976 All models are wrong but some are useful Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration (...). Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 11 / 28

Overparameterization George E.P.Box,(1919-2013) Quote, 1976 All models are wrong but some are useful Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration (...). Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 11 / 28

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 12 / 28

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 12 / 28

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. There is some disagreement over the use of AIC and BIC with non-nested models. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 12 / 28

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. There is some disagreement over the use of AIC and BIC with non-nested models. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 12 / 28

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 13 / 28

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. Cross-validation can be used in a wider range of model selections tasks, even in cases where it is hard to pinpoint the number of predictors in the model. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 13 / 28

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. Cross-validation can be used in a wider range of model selections tasks, even in cases where it is hard to pinpoint the number of predictors in the model. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 13 / 28

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 14 / 28

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. Bootstraping. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 14 / 28

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. Bootstraping. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 14 / 28

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 15 / 28

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. The idea is to randmoly divide the data into k equal-sized parts. We leave out part k, fit the model to the other k-1 parts (combined), and then obtain predictions for the left-out kth part. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 15 / 28

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. The idea is to randmoly divide the data into k equal-sized parts. We leave out part k, fit the model to the other k-1 parts (combined), and then obtain predictions for the left-out kth part. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 15 / 28

K-fold Cross-validation K-fold MSE k CV = k k 1 = n k n MSE k i C k (y i (ŷ i ))/n k Seeting K = n yields n-fold or leave-one-out cross-validation (LOOCV) MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 16 / 28

K-fold Cross-validation K-fold MSE k CV = k k 1 = n k n MSE k i C k (y i (ŷ i ))/n k Seeting K = n yields n-fold or leave-one-out cross-validation (LOOCV) MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 16 / 28

Model performance: Internal Validation (AUC) AUC The AUC is a global summary measure of a diagnostic test accuracy and discrimination. The greater the AUC, the more able is the test to capture the trade-off between Se and Sp over a continuous range. An important aspect of predictive modeling is the ability of a model to generalize to new cases. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 17 / 28

Model performance: Internal Validation (AUC) AUC The AUC is a global summary measure of a diagnostic test accuracy and discrimination. The greater the AUC, the more able is the test to capture the trade-off between Se and Sp over a continuous range. An important aspect of predictive modeling is the ability of a model to generalize to new cases. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 17 / 28

Model performance: Internal Validation (AUC) AUC The AUC is a global summary measure of a diagnostic test accuracy and discrimination. The greater the AUC, the more able is the test to capture the trade-off between Se and Sp over a continuous range. An important aspect of predictive modeling is the ability of a model to generalize to new cases. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 17 / 28

Predictive performance: internal validation AUC Evaluating the predictive performance (AUC) of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. K-fold cross-validation can be used to generate a more realistic estimate of predictive performance when the number of observations is not very large (Ledell, 2015). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 18 / 28

Predictive performance: internal validation AUC Evaluating the predictive performance (AUC) of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. K-fold cross-validation can be used to generate a more realistic estimate of predictive performance when the number of observations is not very large (Ledell, 2015). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 18 / 28

Predictive performance: internal validation AUC Evaluating the predictive performance (AUC) of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. K-fold cross-validation can be used to generate a more realistic estimate of predictive performance when the number of observations is not very large (Ledell, 2015). MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 18 / 28

CVAUROC cvauroc cvauroc implements k-fold cross-validation for the AUC for a binary outcome after fitting a logistic regression model and provides the cross-validated fitted probabilities for the dependent variable or outcome, contained in a new variable named fit. GitHub cvauroc development version https://github.com/migariane/cvauroc MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 19 / 28

CVAUROC cvauroc cvauroc implements k-fold cross-validation for the AUC for a binary outcome after fitting a logistic regression model and provides the cross-validated fitted probabilities for the dependent variable or outcome, contained in a new variable named fit. GitHub cvauroc development version https://github.com/migariane/cvauroc Stata ssc ssc install cvauroc MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 19 / 28

CVAUROC cvauroc cvauroc implements k-fold cross-validation for the AUC for a binary outcome after fitting a logistic regression model and provides the cross-validated fitted probabilities for the dependent variable or outcome, contained in a new variable named fit. GitHub cvauroc development version https://github.com/migariane/cvauroc Stata ssc ssc install cvauroc MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 19 / 28

CVAUROC cvauroc cvauroc implements k-fold cross-validation for the AUC for a binary outcome after fitting a logistic regression model and provides the cross-validated fitted probabilities for the dependent variable or outcome, contained in a new variable named fit. GitHub cvauroc development version https://github.com/migariane/cvauroc Stata ssc ssc install cvauroc MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 19 / 28

cvauroc Stata command syntax cvauroc Syntax cvauroc depvar varlist [if] [pw] [Kfold] [Seed] [, Cluster(varname) Detail Graph] MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 20 / 28

Classical AUC estimation. use http://www.stata-press.com/data/r14/cattaneo2.dta. gen lbw = cond(bweight<2500,1,0.). logistic lbw mage medu mmarried prenatal fedu mbsmoke mrace order Logistic regression Number of obs = 4,642 lbw Odds Ratio Std. Err. z P> z [95% Conf. Interval] -+ mage.9959165.0140441-0.29 0.772.9687674 1.023826 medu.9451338.0283732-1.88 0.060.8911276 1.002413 mmarried.6109995.1014788-2.97 0.003.4412328.8460849 prenatal.5886787.073186-4.26 0.000.4613759.7511069 fedu 1.040936.0214226 1.95 0.051.9997838 1.083782 mbsmoke 2.145619.3055361 5.36 0.000 1.623086 2.836376 mrace.3789501.057913-6.35 0.000.2808648.5112895 order 1.05529.0605811 0.94 0.349.9429895 1.180964. predict fitted, pr. roctab lbw fitted ROC -Asymptotic Normal Obs Area Std. Err. [95% Conf. Interval] 4,642 0.6939 0.0171 0.66041 0.72749 MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 21 / 28

Crossvalidated AUC using cvauroc. cvauroc lbw mage medu mmarried prenatal fedu mbsmoke mrace order, kfold(10) seed(12) 1-fold... 2-fold... 3-fold... 4-fold... 5-fold... 6-fold... 7-fold... 8-fold... 9-fold... 10-fold... Random seed: 12 ROC -Asymptotic Normal Obs Area Std. Err. [95% Conf. Interval] 4,642 0.6826 0.0174 0.64842 0.71668 MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 22 / 28

cvauroc detail and graph options // Using detail option to show the table of cutoff values and their respective Se, S // and likelihood ratio values.. cvauroc lbw mage medu mmarried prenatal1 fedu mbsmoke mrace fbaby, kfold(10) seed(3489) detail Detailed report of sensitivity and specificity Correctly Cutpoint Sensitivity Specificity Classified LR+ LR- ( >=.019 ) 100.00% 0.00% 6.01% 1.0000 ( >=.025 ) 99.64% 0.18% 6.16% 0.9982 1.9547 ( >=.026 ) 99.64% 0.39% 6.36% 1.0003 0.9199 (...) Omitted results ( >=.272 ) 1.08% 99.93% 93.99% 15.6389 0.9899 ( >=.273 ) 0.72% 99.93% 93.97% 10.4259 0.9935 ( >=.300 ) 0.36% 99.95% 93.97% 7.8181 0.9969 // Using the "graph" option to display the ROC curve. cvauroc lbw mage medu mmarried prenatal1 fedu mbsmoke mrace fbaby, kfold(10) seed(3489) graph. graph export "your_path/figure1.eps", as(eps) preview(off) MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 23 / 28

cvauroc: Cross-validated AUC 1.00 0.75 Sensitivity 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 1 Specificity MA Luque Fernandez Area (ibs.granada) under ROC curve Cross-validated = 0.6559 AUC in Stata: CVAUROC 24 October 2018 24 / 28

Conclusion cvauroc Evaluating the predictive performance of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. However, cvauroc is user-friendly and helpful k-fold internal cross-validation technique that might be considered when reporting the AUC in observational studies. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 25 / 28

Conclusion cvauroc Evaluating the predictive performance of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. However, cvauroc is user-friendly and helpful k-fold internal cross-validation technique that might be considered when reporting the AUC in observational studies. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 25 / 28

Conclusion cvauroc Evaluating the predictive performance of a set of independent variables using all cases from the original analysis sample tends to result in an overly optimistic estimate of predictive performance. However, cvauroc is user-friendly and helpful k-fold internal cross-validation technique that might be considered when reporting the AUC in observational studies. MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 25 / 28

References MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 26 / 28

References MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 27 / 28

Thank you THANK YOU FOR YOUR TIME MA Luque Fernandez (ibs.granada) Cross-validated AUC in Stata: CVAUROC 24 October 2018 28 / 28