REGRESSION ASSOCIATION VS. PREDICTION

Similar documents
Going Below the Surface Level of a System This lesson plan is an overview of possible uses of the

Evaluation of Accuracy of U.S. DOT Rail-Highway Grade Crossing Accident Prediction Models

PHA Exam 1. Spring 2013

National Assessment in Sweden. A multi-dimensional (ad)venture

Chapter 12 Student Lecture Notes 12-1

Research into the effect of the treatment of the carpal tunnel syndrome with the Phystrac traction device

Dr She Lok, Dr David Greenberg, Barbara Gill, Andrew Murphy, Dr Linda McNamara

Fall 2005 Economics and Econonic Methods Prelim. (Shevchenko, Chair; Biddle, Choi, Iglesias, Martin) Econometrics: Part 4

EXPERIMENT 4 DETERMINATION OF ACCELERATION DUE TO GRAVITY AND NEWTON S SECOND LAW

Evaluation Of Logistic Regression In Classification Of Drug Data In Kwara State

List 3 ways these pictures are the same, and three ways they are different.

Probability, Genetics, and Games

Published in Public Opinion Quarterly, 39, 1975, Monetary Incentives in Mail Surveys

Time Variation of Expected Returns on REITs: Implications for Market. Integration and the Financial Crisis

YOUR VIEWS ABOUT YOUR HIGH BLOOD PRESSURE

AN ANALYSIS OF TELEPHONE MESSAGES: MINIMIZING UNPRODUCTIVE REPLAY TIME

Reliability Demonstration Test Plan

EXPERIMENTAL DRYING OF TOBACCO LEAVES

Optimize Neural Network Controller Design Using Genetic Algorithm

MATH 1300: Finite Mathematics EXAM 1 15 February 2017

Machine Learning Approach to Identifying the Dataset Threshold for the Performance Estimators in Supervised Learning

Cerebrovascular disease Defined from diagnosis ICD10: I60-69, ICD8:

EVALUATION OF DIAGNOSTIC PERFORMANCE USING PARTIAL AREA UNDER THE ROC CURVE. Hua Ma. B.S. Sichuan Normal University, Chengdu, China, 2007

Determinants of Poverty Status among Broiler Farmers in Calabar Metropolis, Cross River State, Nigeria

VIRTUALLY PAINLESS DROSOPHILA GENETICS STANDARDS A, B, C B, C C, C

An Empirical Analysis of Software Productivity

Application of Biomedical Digital Solutions: Virtual Flow Cytometry and Hematometrics in Pathology and Research

econstor Make Your Publications Visible.

Fusarium vs. Soybean

THEORETICAL AND REVIEW ARTICLES A signal detection analysis of the recognition heuristic

Input Techniques for Neural Networks in Stock Market Prediction Ensembles

The Strengths and Limitations of the Statistical Modeling of Complex Social Phenomenon: Focusing on SEM, Path Analysis, or Multiple Regression Models

AN INVESTIGATION INTO THE EXTENT OF NON-COMPLIANCE WITH THE NATIONAL MINIMUM WAGE 1

Combined use of calcipotriol solution (SOp.g/ ml) and Polytar liquid in scalp psoriasis.

Sign up is easy no personal information needed. Reserve your spot today:

Isolating the Impact of Learning Communities and First-Year Residence Halls on First-Year Student Retention and Success

Adaptive Load Balancing: A Study in Multi-Agent. Learning. Abstract

Improving the Surgical Ward Round.

Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times

Difference in Characteristics of Self-Directed Learning Readiness in Students Participating in Learning Communities

Comparison of lower-hybrid (LH) frequency spectra between at the high-field side (HFS) and low-field side (LFS) in Alcator C-Mod

A Returns-Based Representation of Earnings Quality

Design of a Low Noise Amplifier in 0.18µm SiGe BiCMOS Technology

Importance of prostate volume and urinary flow rate in prediction of bladder outlet obstruction in men with symptomatic benign prostatic hyperplasia

Reliability of fovea palatinea in determining the posterior palatal seal

Short Summary on Materials Testing and Analysis

Blind Estimation of Block Interleaver Parameters using Statistical Characteristics

A Study of Factors Contributing to Low Retention Rates

L4-L7 network services in shared network test plan

AGE DETERMINATION FROM RADIOLOGICAL STUDY OF EPIPHYSIAL APPEARANCE AND FUSION AROUND ELBOW JOINT *Dr. S.S. Bhise, **Dr. S. D.

A Comment on Variance Decomposition and Nesting Effects in Two- and Three-Level Designs

A e C l /C d. S j X e. Z i

Vaccines 2013, 1, ; doi: /vaccines Article. Deanna Kruszon-Moran 1, *, R. Monina Klevens 2 and Geraldine M.

PRELIMINARY STUDY ON DISPLACEMENT-BASED DESIGN FOR SEISMIC RETROFIT OF EXISTING BUILDINGS USING TUNED MASS DAMPER

IBM Research Report. A Method of Calculating the Cost of Reducing the Risk Exposure of Non-compliant Process Instances

Female households and poverty: A case study of Faisalabad District

Or-Light Efficiency and Tolerance New-generation intense and pulsed light system

DISCUSSION ON THE TIMEFRAME FOR THE ACHIEVEMENT OF PE14.

Car Taxes and CO 2 emissions in EU. Summary. Introduction. Author: Jørgen Jordal-Jørgensen, COWI

Econometric Analysis of Malnutrition Severity Among Under Five Children in Bangladesh

The Dynamics of Health Disparities; U.S. Mortality Disparities by Education, Richard Miech, Fred Pampel, Jinyoung Kim, and Rick Rogers

PHA Case Study III (Answers)

Percentage of Non-Caucasians in Clinical Trials from 2000 to By Meghanadh Yerram E Special Project

Statistical Techniques For Comparing ACT-R Models of Cognitive Performance

Psychosocial Mechanisms of Smoking Abstinence among Smokers of Low Socioeconomic Status. Yessenia Castro, Ph.D. May 31, 2013

Tutorial Exercises and Case Studies. Design Tutorials

NAMUR Choices of Wine Consumption Measure of Interaction Terms and Attributes

New Methods for Modeling Reliability Using Degradation Data

Cattle Finishing Net Returns in 2017 A Bit Different from a Year Ago Michael Langemeier, Associate Director, Center for Commercial Agriculture

Gentamicin Therapy Induced Functional Type VII Collagen in RDEB Patients Harboring Nonsense Mutations. David T. Woodley and Mei Chen

A Robust R-peak Detection Algorithm using Wavelet Packets

Implementation of a planar coil of wires as a sinusgalvanometer. Analysis of the coil magnetic field

A Numerical Analysis of the Effect of Sampling of Alternatives in Discrete Choice Models

CARAT An Operational Approach to Risk Assessment Definitions, Processes, and Studies

Simulation of Communication Systems

Journal Of Business & Economics Research Volume 1, Number 6

Form. Tick the boxes below to indicate your change(s) of circumstance and complete the relevant sections of this form

San Raffaele Hospital and Emo Centro Cuore Columbus Milan- Italy

Bootstrapped Multinomial Logistic Regression on Apnea Detection Using ECG Data

2 Arrange the following angles in order from smallest to largest. A B C D E F. 3 List the pairs of angles which look to be the same size.

Emerging Subsea Networks

Individual Rationality and Learning: Welfare Expectations in East Germany Post- Reunification

SPREAD OVERREACTION IN INTERNATIONAL BOND MARKETS

Localization Performance of Real and Virtual Sound Sources

Direct and maternal effects on calving ease in heifers and second parity Piemontese cows

Original Article. Effect of performing regular organized recreational activities on the depression level DRAGAN KRIVOKAPIĆ

Home page: Home page: Laugh Zone Activity: Investigation: Investigation: Investigation: Laugh Zone Maths in Action: Activity:

Digital Signal Processing Homework 7 Solutions in progress

Why are Virgin Adolescents Worried about Contracting HIV/AIDS? Evidence from Four Sub-Saharan African Countries

Plan for the implementation of the National Public Procurement Strategy

Bridge Maintenance Survey for Indiana Counties

Thermal Stress Prediction within the Contact Surface during Creep Feed Deep Surface Grinding

Presentation to the Senate Committee on Health & Human Services June 16, The University of Texas Health Science Center at Houston (UTHealth)

P215 Cardiovascular System Blood Vessels Chapter 14 Introduction

Evaluation of Advanced Wind Power Forecasting Models Results of the Anemos Project

How Asset Maintenance Strategy Selection Affects Defect Elimination, Failure Prevention and Equipment Reliability

Towards User-Adaptive Information Visualization

Catriona Crossan Health Economics Research Group (HERG), Brunel University

Eugene Charniak and Eugene Santos Jr. Department of Computer Science Brown University Providence RI and

Breeding new sugarcane clones by mixed models under genotype by environmental interaction

Transcription:

BIOSTATISTICS WORKSHOP: REGRESSION ASSOCIATION VS. PREDICTION Sub-Saharan Africa CFAR mting July 18, 2016 Durban, South Africa Rgrssion what is it good for? Explor Associations Btwn outcoms and xposurs Btwn outcoms and xposurs adjusting for confoundrs Hypothsis Tsting Prdiction Gnrat modls to prdict an outcom or vnt Slct variabls to b includd in prdiction modls Gnrat ruls for disas prdiction It s th solution to all our problms in mdical rsarch 1

Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia Cross-sctional Study of n=1000 HIV+ popl Collct information on Scor on mmory tst (continuous: highr is bttr) Dmntia diagnosis (binary) Ag (continuous) Sx (binary) Clinic Siz of houshold (continuous) Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) Outcoms Prdictors and covariats Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia Cross-sctional Study of n=1000 HIV+ popl Collct information on Scor on mmory tst (continuous: highr is bttr) Dmntia diagnosis (binary) Ag (continuous) Sx (binary) Clinic Siz of houshold (continuous) Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) Outcoms Prdictors and covariats 2

Rgrssion what is it good for? Explor Associations Btwn outcoms and xposurs Btwn outcoms and xposurs adjusting for confoundrs Hypothsis Tsting Prdiction Gnrat modls to prdict an outcom or vnt Slct variabls to b includd in prdiction modls Gnrat ruls for disas prdiction Dmntia and Mmory Loss in HIV Factors contributing to mmory loss in HIV+ individuals Outcom: Scor on mmory tst Exposur/Covariats Siz of houshold (continuous) Ag (continuous) Sx (binary) Clinic Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) 3

Dmntia and Mmory Loss in HIV Outcom: Continuous Normally distributd Simpl Linar Rgrssion y ˆ ˆ x 1 0 scor ˆ ˆ * siz 0 1 Dmntia and Mmory Loss in HIV Outcom: continuous Linar Rgrssion y ˆ ˆ x 1 0 scor ˆ ˆ * siz 0 1 scor 38.7 0.81* siz Intrprtation? For ach additional prson in houshold, on avrag th mmory scor is 0.81 units highr. 4

Rsults from linar rgrssion β 1 = 0.81, SE(β 1 ) = 0.11, p = 2x10-12 For ach addition prson in a houshold, on avr th scor on a mmory tst is 0.81 units highr. scor 38.7 0.81* siz This association is statistically significant (p=2x10-12 ) R 2 = 0.048 4.8% of th varianc in mmory scors can b xplaind by siz of houshold Linar Rgrssion Factors contributing to mmory loss in HIV+ individuals scor 38.7 0.81* siz Ar thr othr factors lading to mmory loss? What about possibl confoundrs? Ag (continuous) Sx (binary) Clinic (Catgorical) Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) 5

Linar Rgrssion y ˆ ˆ 1x 0 Simpl Linar Rgrssion y y ˆ ˆ x 0 ˆ ˆ 0 1x1... 1 1 ˆ x 2 2 ˆ ˆ x x 3 3 Multipl Linar Rgrssion Association with mmory scor p = 0.02 p < 10-8 p < 10-8 6

From simpl to multipl Outcom: Mmory Scor Primary xposur of intrst: Siz of houshold Covariat: Ag Dos ag confound th rlationship btwn Siz of houshold and Mmory Scor? Ag Siz Mmory Scor Adding Ag Simpl Modl Modl with ag Bta (SE) p-valu Bta (SE) p-valu Siz of houshold 0.81 (0.11) 2x10-12 0.41 (0.12) 0.0005 Ag -0.38 (0.05) 4x10-16 Adjustd R 2 0.047 0.108 Qustions to as: (1) Is siz associatd with mmory scor whn adjusting for ag? (2) Has th association btwn houshold siz and mmory scor changd? 7

Adding Ag scor 38.7 0.81* siz scor 65.5 0.42* siz 0.38* ag Whn controlling for ag th association btwn siz of houshold and scor dcrasd from 0.81 to 0.42 (48%) Rmaind significant (p=0.0005) This suggsts that th rlationship btwn houshold siz and scor is at last partially mdiatd through ag Rul of Thumb: If th β changd by > 10% thn considr it a confoundr or mdiator in th main association From simpl to multipl scor 65.5 0.42* siz 0.38* ag Holding ag constant, for ach additional prson in a houshold, on avrag th mmory scor is 0.42 highr. Among thos with th sam ag, for ach additional prson in a houshold, on avrag th mmory scor is 0.42 highr. 8

Association with mmory scor p = 0.02 p < 10-8 p < 10-8 Adding Sx Simpl Modl Modl with ag Modl with ag & sx Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Siz of houshold 0.81 (0.11) 2x10-12 0.41 (0.12) 0.0005 0.42 (0.12) 0.0004 Ag -0.38 (0.05) 4x10-16 -0.38 (0.05) 3x10-16 Mal sx 1.29 (0.52) 0.014 Adjustd R 2 0.047 0.108 0.114 Qustions to as: (1) Is siz associatd with mmory scor whn adjusting for ag & sx? (2) Has th association btwn houshold siz and mmory scor changd? 9

Adding Sx scor 65.5 0.42* siz 0.38* ag scor 65.0 0.42* siz 0.38* ag 1.29* mal Whn adding sx to th modl with ag, th association btwn siz of houshold and scor Did not chang (0.42 = 0.42) Houshold siz rmaind significant (p=0.0004) Sx was significantly associatd with mmory scor (p=0.014) Do w lav sx in th modl? Rul of Thumb: If th β changd by > 10% thn considr it a confoundr or mdiator in th main association Adding Clinic Simpl Modl Modl with ag Modl with ag & clinic Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Siz of houshold 0.81 (0.11) 2x10-12 0.41 (0.12) 0.0005 0.48 (0.12) 0.00007 Ag -0.38 (0.05) 4x10-16 -0.32 (0.05) 4x10-11 Clinic 1 1.0 Clinic 2-1.29 (0.69) 0.061 Clinic 3-2.38 (0.64) 0.0002 Adjustd R 2 0.047 0.108 0.118 Qustions to as: (1) Is siz associatd with mmory scor whn adjusting for ag & clinic? (2) Has th association btwn houshold siz and mmory scor changd? 10

Adding Clinic scor 65.5 0.42* siz 0.38* ag scor 62.8 0.48* siz 0.32* ag 1.29* Clinic2 2.38* Clinic3 Whn adding clinic to th modl with ag, th association btwn siz of houshold and scor Changd (0.42 to 0.48)/0.42 = 14% Houshold siz rmaind significant (p=0.00007) At last on clinic was associatd with houshold siz (ps = 0.06, 0.0002) Do w lav clinic in th modl? Do w lav all 3 clinics in th modl? Rul of Thumb: If th β changd by > 10% thn considr it a confoundr or mdiator in th main association Additional Confoundrs Duration of HIV and Tratmnt did not chang association btwn houshold siz and mmory scor 11

Final Modl scor 62.8 0.48* siz 0.32* ag 1.29* Clinic2 2.38* Clinic3 Largr houshold siz is significantly associatd with highr mmory scor whn controlling for ag and clinic sit. For participants with similar ag and clinic, for ach addition houshold mmbr, thr was a 0.48 highr mmory tst scor. Intrprtation? Choosing Covariats Us nowldg about your outcom, xposur and rsarch qustion to hlp choos covariats to loo at. A priori nowldg! What about automatic slction procdurs? (bacwards, forwards, stpwis) Basd purly on th numbrs (mpirical procss) Usually only loo at significanc NOT usful in hypothsis drivn tsting Do not: Tst all availabl variabls just bcaus you hav thm Includ highly corrlatd variabls (.g. hight & BMI, right hand strngth & lft hand strngth) 12

Rgrssion for hypothsis tsting W did this in Linar Rgrssion, but this procss is th sam for othr rgrssion tchniqus Rmmbr to compar ffct siz! For Logistic, Poisson and Cox Proportional Hazard (Survival Analysis) Rgrssion Compar β stimats For ANOVA Compar diffrnc btwn th group mans adjustd for covariats or Modl as a linar rgrssion (using dummy variabls) and compar β stimats Rgrssion what is it good for? Explor Associations Btwn outcoms and xposurs Btwn outcoms and xposurs adjusting for confoundrs Hypothsis Tsting Prdiction Gnrat modls to prdict an outcom or vnt Slct variabls to b includd in prdiction modls Gnrat ruls for disas prdiction 13

Rgrssion for Prdiction For Association Typically intrstd in singl (or fw) factors that ar associatd with outcom Find othr variabls that confound that association Variabls includd in th modl dpnding on if thy play a rol in primary association of intrst For Prdiction Intrstd in th bst st of variabls for a modl Includ all variabls that improv prdictiv accuracy Lss concrnd with p-valus of spcific variabls (although p-valus ar oftn usd to pars down lists of variabls) Prdictiv Accuracy Discrimination: How wll th modl sparats out cass from controls Rcivr Oprating Charactristic Curv (ROC Curv) Ara Undr th ROC Curv (AUC or c-statistic) Calibration: How wll th prdictd outcom matchs th obsrvd outcom Hosmr-Lmshow Chi-Squar Goodnss-of-fit Statistic R-classification: How wll a nw modl to improvs on an old modl Nt Rclassification Indx (NRI) Intgratd Discrimination Improvmnt (IDI) R-classification Indx 14

Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia Cross-sctional Study of n=1000 HIV+ popl Collct information on Dmntia diagnosis (binary) Ag (continuous) Sx (binary) Clinic Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) Outcoms Prdictors and covariats Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia (dx_dm) Modl: Logistic Rgrssion pˆ Compar 4 Modls logit ln ˆ 0 ˆ 1 p Modl 1: dx_dm = ag Modl 2: dx_dm = ag + sx Modl 3: dx_dm = ag + sx + HIV_duration ˆ 0 Modl 4: dx_dm = ag + sx + HIV_duration + HIV tratmnt + Clinic pˆ 1 i 1 ˆ 0 i 1 ˆ X i i 1 ˆ X i i i ˆ X i i 15

Rsults Modl 1 Modl 2 Modl 3 Modl 4 Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Ag 0.11 (0.02) 10-13 0.11 (0.02) 10-13 0.12 (0.02) 10-12 0.11 (0.02) 10-10 Sx (mal) -0.15 (0.19) 0.41-0.18 (0.19) 0.36-0.17 (0.19) 0.357 HIV_dur 0.16 (0.03) 10-10 0.16 (0.03) 10-10 HIV trt -0.46 (0.20) 0.02 Clinic 2 0.10 (0.28) 0.72 Clinic 3 0.37 (0.25) 0.14 c - stat 0.669 0.671 0.751 0.760 What to loo for: (1)W ar lss intrstd in p-valus or if th Btas chang (2) W ar looing for th masur of discrimination (c-stat) c-statistic For vry possibl cas/control pair dtrmin if Concordant, π c Discordant, π d Ti, π t pˆ pˆ pˆ ˆ cas p control ˆ cas p control ˆ cas p control c pˆ 1 c t 2 c ˆ 1 d 0 i 1 ˆ 0 ˆ X i 1 t i i ˆ X i i 32 16

ROC Curv How to quantify discrimination ovr all possibl thrsholds c-statistic (concordanc statistic) Rcivr Oprating Charactristic (ROC) Curv Ara Undr Curv (AUC) AUC/c-stat rangs from 0.500 to 1.00 Null = 0.500 Prfct prdictiv modl = 1.00 X-Axis = 1-Spcificity (Fals Positiv Fraction) Y-Axis = Snsitivity (Tru Positiv Fraction) 33 Choos a Thrshold Rmmbr: From a logistic rgrssion you can calculat a prdictd probability (phat) of vnt for vryon pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 If th modl prdicts th vnt wll, th distribution of phats for vnts (dottd lin) should b to th right of thos for non-vnts (solid lin) A thrshold is whr vryon with phat > thrshold is calld scrn pos and vryon < phat is calld scrn ng You ta your continuous phats, and ma a dichotomous scrn pos/ng variabl 17

Choos a Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 With a phat thrshold = 0.35 If phat > 0.35 => scrn positiv If phat < 0.35 => scrn ngativ Scrn ngativ Scrn positiv Thn compar scrn pos/ng to actual outcom Phat = 0.35 Prdictiv Accuracy Snsitivity (TPF) for scrning tst P(scrn = pos DS=1) = a / (a+c) FPF (1-spcificity) for scrning tst P(scrn = pos DS=0) = b / (b+d) Misclassification Rat = (c+b)/n Scrning Tst (X) Pos (1) Ng (0) Disas(Y) Ys (1) No (0) Total a (TP) c (FN) b (FP) d (TN) a+b c+d total a+c b+d n 36 18

Prdictiv Accuracy Snsitivity (TPF) for scrning tst P(scrn = pos DS=1) = a / (a+c) 20 / 160 = 12.5% FPF (1-spcificity) for scrning tst P(scrn = pos DS=0) = b / (b+d) 33 / 840 = 4% Misclassification Rat = (c+b)/n = (33+140)/1000 = 17.3% Scrning Tst (X) Disas(Y) Ys (1) No (0) Total Pos (1) 20 33 53 Ng (0) 140 807 947 total 160 840 1000 37 Mov th Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 With a phat thrshold = 0.35 If phat > 0.35 => scrn positiv If phat < 0.35 => scrn ngativ Scrn ngativ Scrn positiv Thn compar scrn pos/ng to actual outcom Phat = 0.35 19

Mov th Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 With a phat thrshold = 0.21 If phat > 0.21 => scrn positiv If phat < 0.21 => scrn ngativ Scrn ngativ Scrn positiv Thn compar scrn pos/ng to actual outcom Phat = 0.21 Prdictiv Accuracy Snsitivity (TPF) for scrning tst P(scrn = pos DS=1) = a / (a+c) 67 / 160 = 41.9% FPF (1-spcificity) for scrning tst P(scrn = pos DS=0) = b / (b+d) 130 / 840 = 15% Misclassification Rat = (c+b)/n = (93+130)/1000 = 22.3% Scrning Tst (X) Disas(Y) Ys (1) No (0) Total Pos (1) 67 130 197 Ng (0) 93 710 803 total 160 840 1000 40 20

Mov th Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 Moving th thrshold changs your prdictiv accuracy masurs Moving thrshold highr (right) - Mas it hardr to scrn positiv - Dcrass snsitivity - Dcrass FPF Moving thrshold lowr (lft) - Mas it asir to scrn positiv - Incrass Snsitivity - Incrass FPF Scrn ngativ Scrn positiv ROC Curv Not: - Bst modl appars to b th on with ag, sx and HIV duration (c=0.75) - Tiny improvmnt with addition of trt and clinic (c=0.76), but not statistically diffrnt from prvious modl (p=0.33) - possibly ovrfitting to data - Us ROC to hlp choos variabl list but nd to tst on sparat datast for tru assssmnt of prdictiv accuracy 21

ROC Curv - Good for looing at ovrall discriminatory ability of modls - Uss continuous prdictiv probability - Loos at all possibl thrsholds for prdiction - By changing thrsholds of scrn positiv or scrn ngativ w chang our prdictiv accuracy Rgrssion for Prdiction: Ovrfitting Whn w assss prdicativ accuracy on th sam datast w dvlopd th modl in w ris ovrfitting Ovrfitting Imagin taing a mold of your ft and crating th prfct sho from that mold Th sho will fit grat on you, bst sho you vr had How would it fit your nighbor? Prvnting ovrfitting Sho Sizs! Might los som accuracy, but it is an algorithm that applis to a largr population Idally w hav 2 datasts W dvlop th modl on our data (masur all our ft) [training st] Thn tst it on anothr datast (othr popl s ft) [tsting st] 22

Rgrssion for Prdiction: Ovrfitting Idally w hav 2 datasts W dvlop th modl on our data (masur all our ft) [training st] Thn tst it on anothr datast (othr popl s ft) [tsting st] Can also do cross-validation Choos 10% of data and st asid train th modl in th rmaining 90% tst th modl in th 10% lft out Rpat 10xs and rport th distribution of th rsults (man, SD) Not: This is 10-fold cross-validation 10 is rathr arbitrary do what you sampl siz allows, ma sur thr ar nough vnts/nonvnts in ach st Rgrssion Summary For Association Typically intrstd in singl (or fw) factors that ar associatd with outcom Find othr variabls that confound that association Variabls includd in th modl dpnding on if thy play a rol in primary association of intrst For Prdiction Intrstd in th bst st of variabls for a modl Includ all variabls that improv prdictiv accuracy Lss concrnd with p-valus of spcific variabls (although p-valus ar oftn usd to pars down lists of variabls) B conscious of ovrfitting: always tst in outsid data 23

QUESTIONS? 24