Cross-validation. Miguel Angel Luque Fernandez Faculty of Epidemiology and Population Health Department of Non-communicable Disease.

Similar documents
Cross-validation. Miguel Angel Luque Fernandez Faculty of Epidemiology and Population Health Department of Non-communicable Diseases.

Cross-validated AUC in Stata: CVAUROC

Notes for laboratory session 2

Biostatistics II

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

APPENDIX D REFERENCE AND PREDICTIVE VALUES FOR PEAK EXPIRATORY FLOW RATE (PEFR)

Modeling unobserved heterogeneity in Stata

Multivariate dose-response meta-analysis: an update on glst

Multiple Regression Analysis

Math 215, Lab 7: 5/23/2007

Today. HW 1: due February 4, pm. Matched case-control studies

Supplement for: CD4 cell dynamics in untreated HIV-1 infection: overall rates, and effects of age, viral load, gender and calendar time.

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

MODEL SELECTION STRATEGIES. Tony Panzarella

NORTH SOUTH UNIVERSITY TUTORIAL 2

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Up-to-date survival estimates from prognostic models using temporal recalibration

VARIABLE SELECTION WHEN CONFRONTED WITH MISSING DATA

Dynamic prediction using joint models for recurrent and terminal events: Evolution after a breast cancer

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

Spatiotemporal models for disease incidence data: a case study

Question 1(25= )

Modern Regression Methods

Overview. All-cause mortality for males with colon cancer and Finnish population. Relative survival

Application of Cox Regression in Modeling Survival Rate of Drug Abuse

Prediction and Inference under Competing Risks in High Dimension - An EHR Demonstration Project for Prostate Cancer

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

3. Model evaluation & selection

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Ridge regression for risk prediction

The Statistical Analysis of Failure Time Data

How to analyze correlated and longitudinal data?

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Linear Regression in SAS

Workplace Project Portfolio C (WPP C) Laura Rodwell

1.4 - Linear Regression and MS Excel

Linear Regression Analysis

Mendelian Randomization

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Introduction to regression

MS&E 226: Small Data

MEA DISCUSSION PAPERS

Regression Discontinuity Analysis

J2.6 Imputation of missing data with nonlinear relationships

Statistical reports Regression, 2010

Data Analysis in the Health Sciences. Final Exam 2010 EPIB 621

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

Simple Linear Regression

Final Exam. Thursday the 23rd of March, Question Total Points Score Number Possible Received Total 100

Using dynamic prediction to inform the optimal intervention time for an abdominal aortic aneurysm screening programme

Lecture 14: Adjusting for between- and within-cluster covariates in the analysis of clustered data May 14, 2009

Multi-state survival analysis in Stata

Week 8 Hour 1: More on polynomial fits. The AIC. Hour 2: Dummy Variables what are they? An NHL Example. Hour 3: Interactions. The stepwise method.

Multivariate meta-analysis for non-linear and other multi-parameter associations

Regression analysis of mortality with respect to seasonal influenza in Sweden

Vessel wall differences between middle cerebral artery and basilar artery. plaques on magnetic resonance imaging

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

Proof. Revised. Chapter 12 General and Specific Factors in Selection Modeling Introduction. Bengt Muthén

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

GENERALIZED ESTIMATING EQUATIONS FOR LONGITUDINAL DATA. Anti-Epileptic Drug Trial Timeline. Exploratory Data Analysis. Exploratory Data Analysis

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

CHILD HEALTH AND DEVELOPMENT STUDY

Answer all three questions. All questions carry equal marks.

Selection and Combination of Markers for Prediction

A Comparison of Methods for Determining HIV Viral Set Point

Constructing a mixed model using the AIC

Assessing Studies Based on Multiple Regression. Chapter 7. Michael Ash CPPA

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Supplementary Online Content

Multivariable Cox regression. Day 3: multivariable Cox regression. Presentation of results. The statistical methods section

Performance of Median and Least Squares Regression for Slightly Skewed Data

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page Influence Analysis 1

Strategies for handling missing data in randomised trials

Bayesians methods in system identification: equivalences, differences, and misunderstandings

Data Analysis Using Regression and Multilevel/Hierarchical Models

A Comparison of Shape and Scale Estimators of the Two-Parameter Weibull Distribution

Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

Meta-analysis using individual participant data: one-stage and two-stage approaches, and why they may differ

What is Regularization? Example by Sean Owen

cloglog link function to transform the (population) hazard probability into a continuous

4. Model evaluation & selection

Daniel Boduszek University of Huddersfield

Division of Biostatistics College of Public Health Qualifying Exam II Part I. 1-5 pm, June 7, 2013 Closed Book

Generalized Estimating Equations for Depression Dose Regimes

Sarwar Islam Mozumder 1, Mark J Rutherford 1 & Paul C Lambert 1, Stata London Users Group Meeting

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

The Pretest! Pretest! Pretest! Assignment (Example 2)

Accommodating informative dropout and death: a joint modelling approach for longitudinal and semicompeting risks data

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

LINEAR REGRESSION FOR BIVARIATE CENSORED DATA VIA MULTIPLE IMPUTATION WEI PAN. School of Public Health. 420 Delaware Street SE

Methods for adjusting survival estimates in the presence of treatment crossover a simulation study

2. Scientific question: Determine whether there is a difference between boys and girls with respect to the distance and its change over time.

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Transcription:

Cross-validation Miguel Angel Luque Fernandez Faculty of Epidemiology and Population Health Department of Non-communicable Disease. August 25, 2015 Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 1 / 33

Contents 1 Cross-validation 2 Cross-validation justification 3 Cross-validation methods 4 Examples: Model selection 5 References Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 2 / 33

Cross-validation Definition Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice (note: performance = model assessment). Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 3 / 33

Cross-validation Definition Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice (note: performance = model assessment). Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 3 / 33

Cross-validation Definition Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice (note: performance = model assessment). Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 3 / 33

Cross-validation Applications However, cross-validation can be used to compare the performance of different modeling specifications (i.e. models with and without interactions, inclusion of exclusion of polynomial terms, number of knots with restricted cubic splines, etc). Furthermore, cross-validation can be used in variable selection and select the suitable level of flexibility in the model (note: flexibility = model selection). Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 4 / 33

Cross-validation Applications However, cross-validation can be used to compare the performance of different modeling specifications (i.e. models with and without interactions, inclusion of exclusion of polynomial terms, number of knots with restricted cubic splines, etc). Furthermore, cross-validation can be used in variable selection and select the suitable level of flexibility in the model (note: flexibility = model selection). Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 4 / 33

Cross-validation Applications However, cross-validation can be used to compare the performance of different modeling specifications (i.e. models with and without interactions, inclusion of exclusion of polynomial terms, number of knots with restricted cubic splines, etc). Furthermore, cross-validation can be used in variable selection and select the suitable level of flexibility in the model (note: flexibility = model selection). Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 4 / 33

Cross-validation Applications MODEL ASSESSMENT: To compare the performance of different modeling specifications. MODEL SELECTION: To select the suitable level of flexibility in the model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 5 / 33

Cross-validation Applications MODEL ASSESSMENT: To compare the performance of different modeling specifications. MODEL SELECTION: To select the suitable level of flexibility in the model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 5 / 33

Cross-validation Applications MODEL ASSESSMENT: To compare the performance of different modeling specifications. MODEL SELECTION: To select the suitable level of flexibility in the model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 5 / 33

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 6 / 33

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ Y = f (x) + ɛ Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 6 / 33

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ Y = f (x) + ɛ Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 6 / 33

MSE Regression Model f (x) = f (x 1 + x 2 + x 3 ) Y = βx 1 + βx 2 + βx 3 + ɛ Y = f (x) + ɛ Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 6 / 33

MSE Expectation E(Y X 1 = x 1, X 2 = x 2, X 3 = x 3 ) MSE E[(Y ˆf (X)) 2 X = x] Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 7 / 33

MSE Expectation E(Y X 1 = x 1, X 2 = x 2, X 3 = x 3 ) MSE E[(Y ˆf (X)) 2 X = x] Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 7 / 33

MSE Expectation E(Y X 1 = x 1, X 2 = x 2, X 3 = x 3 ) MSE E[(Y ˆf (X)) 2 X = x] Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 7 / 33

Bias-Variance Trade-off Error descomposition MSE = E[(Y ˆf (X)) 2 X = x] = Var(ˆf (x 0 )) + [Bias(ˆf (x 0 ))] 2 + Var(ɛ) Trade-off As flexibility of ˆf increases, its variance increases, and its bias decreases. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 8 / 33

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Chossing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 9 / 33

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Chossing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. Rule More flexibility increases variance but decreases bias. Less flexibility decreases variance but increases error. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 9 / 33

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Chossing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. Rule More flexibility increases variance but decreases bias. Less flexibility decreases variance but increases error. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 9 / 33

BIAS-VARIANCE-TRADE-OFF Bias-variance trade-off Chossing the model flexibility based on average test error Average Test Error E[(Y ˆf (X)) 2 X = x] And thus, this amounts to a bias-variance trade-off. Rule More flexibility increases variance but decreases bias. Less flexibility decreases variance but increases error. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 9 / 33

Bias-Variance trade-off Regression Function Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 10 / 33

Overparameterization George E.P.Box,(1919-2013) Quote, 1976 All models are wrong but some are useful Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration (...). Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 11 / 33

Overparameterization George E.P.Box,(1919-2013) Quote, 1976 All models are wrong but some are useful Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration (...). Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 11 / 33

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 12 / 33

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 12 / 33

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. There is some disagreement over the use of AIC and BIC with non-nested models. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 12 / 33

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. There is some disagreement over the use of AIC and BIC with non-nested models. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 12 / 33

Justification AIC and BIC AIC and BIC are both maximum likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. AIC = -2*ln(likelihood) + 2*k, k = model degrees of freedom BIC = -2*ln(likelihood) + ln(n)*k, k = model degrees of freedom and N = number of observations. There is some disagreement over the use of AIC and BIC with non-nested models. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 12 / 33

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 13 / 33

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. Cross-validation can be used in a wider range of model selections tasks, even in cases where it is hard to pinpoint the number of predictors in the model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 13 / 33

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. Cross-validation can be used in a wider range of model selections tasks, even in cases where it is hard to pinpoint the number of predictors in the model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 13 / 33

Justification Fewer assumptions Cross-validation compared with AIC, BIC and adjusted R 2 provides a direct estimate of the ERROR. Cross-validation makes fewer assumptions about the true underlying model. Cross-validation can be used in a wider range of model selections tasks, even in cases where it is hard to pinpoint the number of predictors in the model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 13 / 33

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 14 / 33

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. Bootstraping. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 14 / 33

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. Bootstraping. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 14 / 33

Cross-validation strategies Cross-validation options Leave-one-out cross-validation (LOOCV). k-fold cross validation. Bootstraping. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 14 / 33

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 15 / 33

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. The idea is to randmoly divide the data into k equal-sized parts. We leave out part k, fit the model to the other k-1 parts (combined), and then obtain predictions for the left-out kth part. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 15 / 33

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. The idea is to randmoly divide the data into k equal-sized parts. We leave out part k, fit the model to the other k-1 parts (combined), and then obtain predictions for the left-out kth part. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 15 / 33

K-fold Cross-validation K-fold Technique widely used for estimating the test error. Estimates can be used to select the best model, and to give an idea of the test error of the final chosen model. The idea is to randmoly divide the data into k equal-sized parts. We leave out part k, fit the model to the other k-1 parts (combined), and then obtain predictions for the left-out kth part. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 15 / 33

K-fold Cross-validation K-fold MSE k CV = k k 1 = n k n MSE k i C k (y i (ŷ i ))/n k Seeting K = n yields n-fold or leave-one-out cross-validation (LOOCV) Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 16 / 33

K-fold Cross-validation K-fold MSE k CV = k k 1 = n k n MSE k i C k (y i (ŷ i ))/n k Seeting K = n yields n-fold or leave-one-out cross-validation (LOOCV) Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 16 / 33

K-fold Cross-validation K-fold MSE k CV = k k 1 = n k n MSE k i C k (y i (ŷ i ))/n k Seeting K = n yields n-fold or leave-one-out cross-validation (LOOCV) Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 16 / 33

Particular case Linear regression and polynomials LOOCV (n) = 1 n ( yi ŷ i n 1 h i h i is the leverage coming from the geometrical interpretation of the residuals in the hat matrix. Where h i,j = cov(ŷ i,y j ) var(y j ) Correlation when K = n Which is equal to the ordinary MSE, except the ith residual is divided by 1-h i. However, with LOOCV the estimates from each fold are highly correlated and hence their average can have high variance. A better choice is a K-fold Cross-Validation with K = 5 or 10. i=1 ) 2 Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 17 / 33

Particular case Linear regression and polynomials LOOCV (n) = 1 n ( yi ŷ i n 1 h i h i is the leverage coming from the geometrical interpretation of the residuals in the hat matrix. Where h i,j = cov(ŷ i,y j ) var(y j ) Correlation when K = n Which is equal to the ordinary MSE, except the ith residual is divided by 1-h i. However, with LOOCV the estimates from each fold are highly correlated and hence their average can have high variance. A better choice is a K-fold Cross-Validation with K = 5 or 10. i=1 ) 2 Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 17 / 33

Particular case Linear regression and polynomials LOOCV (n) = 1 n ( yi ŷ i n 1 h i h i is the leverage coming from the geometrical interpretation of the residuals in the hat matrix. Where h i,j = cov(ŷ i,y j ) var(y j ) Correlation when K = n Which is equal to the ordinary MSE, except the ith residual is divided by 1-h i. However, with LOOCV the estimates from each fold are highly correlated and hence their average can have high variance. A better choice is a K-fold Cross-Validation with K = 5 or 10. i=1 ) 2 Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 17 / 33

Motivation Motivation To investigate the functional form of a continuous variable. In linear regression, and generalized linear models, partial residuals are used to assess whether continuous covariates are in the model in their correct form, or whether a transformation is needed. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 18 / 33

Motivation Motivation To investigate the functional form of a continuous variable. In linear regression, and generalized linear models, partial residuals are used to assess whether continuous covariates are in the model in their correct form, or whether a transformation is needed. In survival models we have to assess is whether the log-hazard is linear in x. We use a scatter plot of martingale residuals againts the variable in assesment. If the scatterplot is linear, this indicates that the log-hazard is linear in x otherwise a transformation of the variable is suitable. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 18 / 33

Motivation Motivation To investigate the functional form of a continuous variable. In linear regression, and generalized linear models, partial residuals are used to assess whether continuous covariates are in the model in their correct form, or whether a transformation is needed. In survival models we have to assess is whether the log-hazard is linear in x. We use a scatter plot of martingale residuals againts the variable in assesment. If the scatterplot is linear, this indicates that the log-hazard is linear in x otherwise a transformation of the variable is suitable. Hosmer, D. W., Lemeshow S., and May, S. (2008). Applied Survival Analysis: Regression Modeling of Time to Event Data. Wiley, 2nd edition, 392 pages. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 18 / 33

Motivation Motivation To investigate the functional form of a continuous variable. In linear regression, and generalized linear models, partial residuals are used to assess whether continuous covariates are in the model in their correct form, or whether a transformation is needed. In survival models we have to assess is whether the log-hazard is linear in x. We use a scatter plot of martingale residuals againts the variable in assesment. If the scatterplot is linear, this indicates that the log-hazard is linear in x otherwise a transformation of the variable is suitable. Hosmer, D. W., Lemeshow S., and May, S. (2008). Applied Survival Analysis: Regression Modeling of Time to Event Data. Wiley, 2nd edition, 392 pages. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 18 / 33

Motivation Martingale residuals Given the ln h(t) = ln h 0 (t) + xβ; The martingale residuals are defined as: ( ) ˆM i = c i Ĥ t i, x i, ˆβ Hosmer, D. W., Lemeshow S., and May, S. (2008). Applied Survival Analysis: Regression Modeling of Time to Event Data. Wiley, 2nd edition, 392 pages. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 19 / 33

The Data [fontsize=\small] variable name type format - sex str1 %9s labels: F(1), M(2) age byte %8.0g tt float %9.0g site str5 %9s labels:ear, Face, Neck, Scalp censor byte %8.0g survival float %9.0g - Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 20 / 33

The Data - _t Haz. Ratio Std. Err. z P> z [95% Conf. Interval] + tt 1.15688.0477728 3.53 0.000 1.066936 1.254406 _Isex_2.0707031.0600307-3.12 0.002.0133882.3733826 _Isite_2.1051597.0869398-2.72 0.006.020803.5315848 _Isite_3.1203827.1145662-2.22 0.026.0186419.7773897 _Isite_4.4360958.3629804-1.00 0.319.0853281 2.228804 _IsexXsit_2_2 13.54933 12.82601 2.75 0.006 2.11913 86.632 _IsexXsit_2_3 15.7232 16.72104 2.59 0.010 1.955778 126.4044 _IsexXsit_2_4 3.322081 3.126073 1.28 0.202.5253284 21.00823 - Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 21 / 33

Male vs Female lincom _Isite_2 + _IsexXsit_2_2 ( 1) _Isite_2 + _IsexXsit_2_2 = 0 _t Haz. Ratio Std. Err. z P> z [95% Conf. Interval] -+ (1) 1.424844.647053 0.78 0.436.5850838 3.469898 Males with face melanomas do not have significantly different death rates to females with face melanomas, of the same thickness.. lincom _Isite_3 + _IsexXsit_2_3, hr ( 1) _Isite_3 + _IsexXsit_2_3 = 0 _t Haz. Ratio Std. Err. z P> z [95% Conf. Interval] -+ (1) 1.892801.8879783 1.36 0.174.7547045 4.74715 Males with scalp melanomas do not have significantly different death rates to females with scalp melanomas, of the same thickness. Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 22 / 33

Kaplan-Meier survival estimates 1.00 0.75 0.50 0.25 0 5 10 15 0.00 20 analysis time Ear/F Ear/M Face/F Face/M Neck/F Neck/M Scalp/F Scalp/M Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 23 / 33

1 1 1 0 0 0-1 -1-1 -2-2 -3 0 5 10 15 tt -2-1 0 1 2 3 lntt -2 0 1 2 3 4 sqtt -3 martingale lowess mg1 tt martingale lowess mg2 lntt martingale lowess mg3 sqtt Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 24 / 33

Model assessment - Variable a b c -+ - tt 1.1568797 _Isex_2.07070305.07496949.07373205 _Isite_2.10515972.10639845.10642456 _Isite_3.12038267.11606331.11908885 _Isite_4.43609584.42994272.43798124 _IsexXsit_~2 13.549333 13.858472 13.546268 _IsexXsit_~3 15.723202 16.500689 15.947097 _IsexXsit_~4 3.3220807 3.2716299 3.2410989 lntt 1.5596707 sqtt 1.7353908 -+ - AIC 699.66119 700.98922 700.37545 BIC 724.00859 725.33662 724.72285 - Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 25 / 33

K-fold cross-validation Stata; K=10 crossfold: xi: streg tt i.sex*i.site, dist(exp) mae k(10) matrix list r(est) matrix a = r(est) matrix list a svmat double a, name(modela) mean modela1 gen modela = modela1 crossfold: xi: streg lntt i.sex*i.site, dist(exp) mae k(10) matrix list r(est) matrix b = r(est) matrix list b svmat double b, name(modelb) mean modelb1 gen modelb = modelb1 crossfold: xi: streg sqtt i.sex*i.site, dist(exp) mae k(10) matrix list r(est) matrix c = r(est) matrix list c svmat double b, name(modelc) mean modelc1 gen modelc = modelc1 mean modela modelb modelc Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 26 / 33

K-fold cross-validation k=10 Mean Std. Err. [95% Conf. Interval] -+ modela 3.411219.2175402 2.919109 3.903329 modelb 3.497022.2103749 3.005649 3.927495 modelc 3.497522.2121749 3.017549 3.977495 Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 27 / 33

Models #fit first degree polynomial equation: fit <- lm(mpg~horsepower,data=auto) #Polynomial degrees fit2 <- lm(mpg~poly(horsepower,2,raw=true), data=auto) fit3 <- lm(mpg~poly(horsepower,3,raw=true), data=auto) fit4 <- lm(mpg~poly(horsepower,4,raw=true), data=auto) fit5 <- lm(mpg~poly(horsepower,5,raw=true), data=auto) #generate range of 50 numbers starting from 30 and ending at 160 plot(mpg~horsepower,data=auto, bty="l") xx <- seq(10,250, length=50) lines(xx, predict(fit, data.frame(horsepower=xx)), col="red") lines(xx, predict(fit2, data.frame(horsepower=xx)), col="green") lines(xx, predict(fit3, data.frame(horsepower=xx)), col="blue") lines(xx, predict(fit4, data.frame(horsepower=xx)), col="black") lines(xx, predict(fit5, data.frame(horsepower=xx)), col="purple") Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 28 / 33

Gareth James: R Dataset Auto Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 29 / 33

10-fold CV Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 30 / 33

References Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 31 / 33

References Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 32 / 33

Thank you THANK YOU FOR YOUR TIME Cancer Survival Group (LSH&TM) Cross-validation August 25, 2015 33 / 33