BIOSTATISTICS WORKSHOP: REGRESSION ASSOCIATION VS. PREDICTION Sub-Saharan Africa CFAR mting July 18, 2016 Durban, South Africa Rgrssion what is it good for? Explor Associations Btwn outcoms and xposurs Btwn outcoms and xposurs adjusting for confoundrs Hypothsis Tsting Prdiction Gnrat modls to prdict an outcom or vnt Slct variabls to b includd in prdiction modls Gnrat ruls for disas prdiction It s th solution to all our problms in mdical rsarch 1
Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia Cross-sctional Study of n=1000 HIV+ popl Collct information on Scor on mmory tst (continuous: highr is bttr) Dmntia diagnosis (binary) Ag (continuous) Sx (binary) Clinic Siz of houshold (continuous) Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) Outcoms Prdictors and covariats Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia Cross-sctional Study of n=1000 HIV+ popl Collct information on Scor on mmory tst (continuous: highr is bttr) Dmntia diagnosis (binary) Ag (continuous) Sx (binary) Clinic Siz of houshold (continuous) Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) Outcoms Prdictors and covariats 2
Rgrssion what is it good for? Explor Associations Btwn outcoms and xposurs Btwn outcoms and xposurs adjusting for confoundrs Hypothsis Tsting Prdiction Gnrat modls to prdict an outcom or vnt Slct variabls to b includd in prdiction modls Gnrat ruls for disas prdiction Dmntia and Mmory Loss in HIV Factors contributing to mmory loss in HIV+ individuals Outcom: Scor on mmory tst Exposur/Covariats Siz of houshold (continuous) Ag (continuous) Sx (binary) Clinic Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) 3
Dmntia and Mmory Loss in HIV Outcom: Continuous Normally distributd Simpl Linar Rgrssion y ˆ ˆ x 1 0 scor ˆ ˆ * siz 0 1 Dmntia and Mmory Loss in HIV Outcom: continuous Linar Rgrssion y ˆ ˆ x 1 0 scor ˆ ˆ * siz 0 1 scor 38.7 0.81* siz Intrprtation? For ach additional prson in houshold, on avrag th mmory scor is 0.81 units highr. 4
Rsults from linar rgrssion β 1 = 0.81, SE(β 1 ) = 0.11, p = 2x10-12 For ach addition prson in a houshold, on avr th scor on a mmory tst is 0.81 units highr. scor 38.7 0.81* siz This association is statistically significant (p=2x10-12 ) R 2 = 0.048 4.8% of th varianc in mmory scors can b xplaind by siz of houshold Linar Rgrssion Factors contributing to mmory loss in HIV+ individuals scor 38.7 0.81* siz Ar thr othr factors lading to mmory loss? What about possibl confoundrs? Ag (continuous) Sx (binary) Clinic (Catgorical) Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) 5
Linar Rgrssion y ˆ ˆ 1x 0 Simpl Linar Rgrssion y y ˆ ˆ x 0 ˆ ˆ 0 1x1... 1 1 ˆ x 2 2 ˆ ˆ x x 3 3 Multipl Linar Rgrssion Association with mmory scor p = 0.02 p < 10-8 p < 10-8 6
From simpl to multipl Outcom: Mmory Scor Primary xposur of intrst: Siz of houshold Covariat: Ag Dos ag confound th rlationship btwn Siz of houshold and Mmory Scor? Ag Siz Mmory Scor Adding Ag Simpl Modl Modl with ag Bta (SE) p-valu Bta (SE) p-valu Siz of houshold 0.81 (0.11) 2x10-12 0.41 (0.12) 0.0005 Ag -0.38 (0.05) 4x10-16 Adjustd R 2 0.047 0.108 Qustions to as: (1) Is siz associatd with mmory scor whn adjusting for ag? (2) Has th association btwn houshold siz and mmory scor changd? 7
Adding Ag scor 38.7 0.81* siz scor 65.5 0.42* siz 0.38* ag Whn controlling for ag th association btwn siz of houshold and scor dcrasd from 0.81 to 0.42 (48%) Rmaind significant (p=0.0005) This suggsts that th rlationship btwn houshold siz and scor is at last partially mdiatd through ag Rul of Thumb: If th β changd by > 10% thn considr it a confoundr or mdiator in th main association From simpl to multipl scor 65.5 0.42* siz 0.38* ag Holding ag constant, for ach additional prson in a houshold, on avrag th mmory scor is 0.42 highr. Among thos with th sam ag, for ach additional prson in a houshold, on avrag th mmory scor is 0.42 highr. 8
Association with mmory scor p = 0.02 p < 10-8 p < 10-8 Adding Sx Simpl Modl Modl with ag Modl with ag & sx Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Siz of houshold 0.81 (0.11) 2x10-12 0.41 (0.12) 0.0005 0.42 (0.12) 0.0004 Ag -0.38 (0.05) 4x10-16 -0.38 (0.05) 3x10-16 Mal sx 1.29 (0.52) 0.014 Adjustd R 2 0.047 0.108 0.114 Qustions to as: (1) Is siz associatd with mmory scor whn adjusting for ag & sx? (2) Has th association btwn houshold siz and mmory scor changd? 9
Adding Sx scor 65.5 0.42* siz 0.38* ag scor 65.0 0.42* siz 0.38* ag 1.29* mal Whn adding sx to th modl with ag, th association btwn siz of houshold and scor Did not chang (0.42 = 0.42) Houshold siz rmaind significant (p=0.0004) Sx was significantly associatd with mmory scor (p=0.014) Do w lav sx in th modl? Rul of Thumb: If th β changd by > 10% thn considr it a confoundr or mdiator in th main association Adding Clinic Simpl Modl Modl with ag Modl with ag & clinic Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Siz of houshold 0.81 (0.11) 2x10-12 0.41 (0.12) 0.0005 0.48 (0.12) 0.00007 Ag -0.38 (0.05) 4x10-16 -0.32 (0.05) 4x10-11 Clinic 1 1.0 Clinic 2-1.29 (0.69) 0.061 Clinic 3-2.38 (0.64) 0.0002 Adjustd R 2 0.047 0.108 0.118 Qustions to as: (1) Is siz associatd with mmory scor whn adjusting for ag & clinic? (2) Has th association btwn houshold siz and mmory scor changd? 10
Adding Clinic scor 65.5 0.42* siz 0.38* ag scor 62.8 0.48* siz 0.32* ag 1.29* Clinic2 2.38* Clinic3 Whn adding clinic to th modl with ag, th association btwn siz of houshold and scor Changd (0.42 to 0.48)/0.42 = 14% Houshold siz rmaind significant (p=0.00007) At last on clinic was associatd with houshold siz (ps = 0.06, 0.0002) Do w lav clinic in th modl? Do w lav all 3 clinics in th modl? Rul of Thumb: If th β changd by > 10% thn considr it a confoundr or mdiator in th main association Additional Confoundrs Duration of HIV and Tratmnt did not chang association btwn houshold siz and mmory scor 11
Final Modl scor 62.8 0.48* siz 0.32* ag 1.29* Clinic2 2.38* Clinic3 Largr houshold siz is significantly associatd with highr mmory scor whn controlling for ag and clinic sit. For participants with similar ag and clinic, for ach addition houshold mmbr, thr was a 0.48 highr mmory tst scor. Intrprtation? Choosing Covariats Us nowldg about your outcom, xposur and rsarch qustion to hlp choos covariats to loo at. A priori nowldg! What about automatic slction procdurs? (bacwards, forwards, stpwis) Basd purly on th numbrs (mpirical procss) Usually only loo at significanc NOT usful in hypothsis drivn tsting Do not: Tst all availabl variabls just bcaus you hav thm Includ highly corrlatd variabls (.g. hight & BMI, right hand strngth & lft hand strngth) 12
Rgrssion for hypothsis tsting W did this in Linar Rgrssion, but this procss is th sam for othr rgrssion tchniqus Rmmbr to compar ffct siz! For Logistic, Poisson and Cox Proportional Hazard (Survival Analysis) Rgrssion Compar β stimats For ANOVA Compar diffrnc btwn th group mans adjustd for covariats or Modl as a linar rgrssion (using dummy variabls) and compar β stimats Rgrssion what is it good for? Explor Associations Btwn outcoms and xposurs Btwn outcoms and xposurs adjusting for confoundrs Hypothsis Tsting Prdiction Gnrat modls to prdict an outcom or vnt Slct variabls to b includd in prdiction modls Gnrat ruls for disas prdiction 13
Rgrssion for Prdiction For Association Typically intrstd in singl (or fw) factors that ar associatd with outcom Find othr variabls that confound that association Variabls includd in th modl dpnding on if thy play a rol in primary association of intrst For Prdiction Intrstd in th bst st of variabls for a modl Includ all variabls that improv prdictiv accuracy Lss concrnd with p-valus of spcific variabls (although p-valus ar oftn usd to pars down lists of variabls) Prdictiv Accuracy Discrimination: How wll th modl sparats out cass from controls Rcivr Oprating Charactristic Curv (ROC Curv) Ara Undr th ROC Curv (AUC or c-statistic) Calibration: How wll th prdictd outcom matchs th obsrvd outcom Hosmr-Lmshow Chi-Squar Goodnss-of-fit Statistic R-classification: How wll a nw modl to improvs on an old modl Nt Rclassification Indx (NRI) Intgratd Discrimination Improvmnt (IDI) R-classification Indx 14
Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia Cross-sctional Study of n=1000 HIV+ popl Collct information on Dmntia diagnosis (binary) Ag (continuous) Sx (binary) Clinic Tratmnt status (catgorical) Yars sinc diagnosis of HIV (continuous) Outcoms Prdictors and covariats Dmntia and Mmory Loss in HIV Explor factors that contribut to mmory loss in HIV+ individuals Crat a prdiction rul for onst of dmntia (dx_dm) Modl: Logistic Rgrssion pˆ Compar 4 Modls logit ln ˆ 0 ˆ 1 p Modl 1: dx_dm = ag Modl 2: dx_dm = ag + sx Modl 3: dx_dm = ag + sx + HIV_duration ˆ 0 Modl 4: dx_dm = ag + sx + HIV_duration + HIV tratmnt + Clinic pˆ 1 i 1 ˆ 0 i 1 ˆ X i i 1 ˆ X i i i ˆ X i i 15
Rsults Modl 1 Modl 2 Modl 3 Modl 4 Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Bta (SE) p-valu Ag 0.11 (0.02) 10-13 0.11 (0.02) 10-13 0.12 (0.02) 10-12 0.11 (0.02) 10-10 Sx (mal) -0.15 (0.19) 0.41-0.18 (0.19) 0.36-0.17 (0.19) 0.357 HIV_dur 0.16 (0.03) 10-10 0.16 (0.03) 10-10 HIV trt -0.46 (0.20) 0.02 Clinic 2 0.10 (0.28) 0.72 Clinic 3 0.37 (0.25) 0.14 c - stat 0.669 0.671 0.751 0.760 What to loo for: (1)W ar lss intrstd in p-valus or if th Btas chang (2) W ar looing for th masur of discrimination (c-stat) c-statistic For vry possibl cas/control pair dtrmin if Concordant, π c Discordant, π d Ti, π t pˆ pˆ pˆ ˆ cas p control ˆ cas p control ˆ cas p control c pˆ 1 c t 2 c ˆ 1 d 0 i 1 ˆ 0 ˆ X i 1 t i i ˆ X i i 32 16
ROC Curv How to quantify discrimination ovr all possibl thrsholds c-statistic (concordanc statistic) Rcivr Oprating Charactristic (ROC) Curv Ara Undr Curv (AUC) AUC/c-stat rangs from 0.500 to 1.00 Null = 0.500 Prfct prdictiv modl = 1.00 X-Axis = 1-Spcificity (Fals Positiv Fraction) Y-Axis = Snsitivity (Tru Positiv Fraction) 33 Choos a Thrshold Rmmbr: From a logistic rgrssion you can calculat a prdictd probability (phat) of vnt for vryon pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 If th modl prdicts th vnt wll, th distribution of phats for vnts (dottd lin) should b to th right of thos for non-vnts (solid lin) A thrshold is whr vryon with phat > thrshold is calld scrn pos and vryon < phat is calld scrn ng You ta your continuous phats, and ma a dichotomous scrn pos/ng variabl 17
Choos a Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 With a phat thrshold = 0.35 If phat > 0.35 => scrn positiv If phat < 0.35 => scrn ngativ Scrn ngativ Scrn positiv Thn compar scrn pos/ng to actual outcom Phat = 0.35 Prdictiv Accuracy Snsitivity (TPF) for scrning tst P(scrn = pos DS=1) = a / (a+c) FPF (1-spcificity) for scrning tst P(scrn = pos DS=0) = b / (b+d) Misclassification Rat = (c+b)/n Scrning Tst (X) Pos (1) Ng (0) Disas(Y) Ys (1) No (0) Total a (TP) c (FN) b (FP) d (TN) a+b c+d total a+c b+d n 36 18
Prdictiv Accuracy Snsitivity (TPF) for scrning tst P(scrn = pos DS=1) = a / (a+c) 20 / 160 = 12.5% FPF (1-spcificity) for scrning tst P(scrn = pos DS=0) = b / (b+d) 33 / 840 = 4% Misclassification Rat = (c+b)/n = (33+140)/1000 = 17.3% Scrning Tst (X) Disas(Y) Ys (1) No (0) Total Pos (1) 20 33 53 Ng (0) 140 807 947 total 160 840 1000 37 Mov th Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 With a phat thrshold = 0.35 If phat > 0.35 => scrn positiv If phat < 0.35 => scrn ngativ Scrn ngativ Scrn positiv Thn compar scrn pos/ng to actual outcom Phat = 0.35 19
Mov th Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 With a phat thrshold = 0.21 If phat > 0.21 => scrn positiv If phat < 0.21 => scrn ngativ Scrn ngativ Scrn positiv Thn compar scrn pos/ng to actual outcom Phat = 0.21 Prdictiv Accuracy Snsitivity (TPF) for scrning tst P(scrn = pos DS=1) = a / (a+c) 67 / 160 = 41.9% FPF (1-spcificity) for scrning tst P(scrn = pos DS=0) = b / (b+d) 130 / 840 = 15% Misclassification Rat = (c+b)/n = (93+130)/1000 = 22.3% Scrning Tst (X) Disas(Y) Ys (1) No (0) Total Pos (1) 67 130 197 Ng (0) 93 710 803 total 160 840 1000 40 20
Mov th Thrshold pˆ 1 ˆ ˆ 0 i X i i 1 ˆ ˆ 0 i X i i 1 Moving th thrshold changs your prdictiv accuracy masurs Moving thrshold highr (right) - Mas it hardr to scrn positiv - Dcrass snsitivity - Dcrass FPF Moving thrshold lowr (lft) - Mas it asir to scrn positiv - Incrass Snsitivity - Incrass FPF Scrn ngativ Scrn positiv ROC Curv Not: - Bst modl appars to b th on with ag, sx and HIV duration (c=0.75) - Tiny improvmnt with addition of trt and clinic (c=0.76), but not statistically diffrnt from prvious modl (p=0.33) - possibly ovrfitting to data - Us ROC to hlp choos variabl list but nd to tst on sparat datast for tru assssmnt of prdictiv accuracy 21
ROC Curv - Good for looing at ovrall discriminatory ability of modls - Uss continuous prdictiv probability - Loos at all possibl thrsholds for prdiction - By changing thrsholds of scrn positiv or scrn ngativ w chang our prdictiv accuracy Rgrssion for Prdiction: Ovrfitting Whn w assss prdicativ accuracy on th sam datast w dvlopd th modl in w ris ovrfitting Ovrfitting Imagin taing a mold of your ft and crating th prfct sho from that mold Th sho will fit grat on you, bst sho you vr had How would it fit your nighbor? Prvnting ovrfitting Sho Sizs! Might los som accuracy, but it is an algorithm that applis to a largr population Idally w hav 2 datasts W dvlop th modl on our data (masur all our ft) [training st] Thn tst it on anothr datast (othr popl s ft) [tsting st] 22
Rgrssion for Prdiction: Ovrfitting Idally w hav 2 datasts W dvlop th modl on our data (masur all our ft) [training st] Thn tst it on anothr datast (othr popl s ft) [tsting st] Can also do cross-validation Choos 10% of data and st asid train th modl in th rmaining 90% tst th modl in th 10% lft out Rpat 10xs and rport th distribution of th rsults (man, SD) Not: This is 10-fold cross-validation 10 is rathr arbitrary do what you sampl siz allows, ma sur thr ar nough vnts/nonvnts in ach st Rgrssion Summary For Association Typically intrstd in singl (or fw) factors that ar associatd with outcom Find othr variabls that confound that association Variabls includd in th modl dpnding on if thy play a rol in primary association of intrst For Prdiction Intrstd in th bst st of variabls for a modl Includ all variabls that improv prdictiv accuracy Lss concrnd with p-valus of spcific variabls (although p-valus ar oftn usd to pars down lists of variabls) B conscious of ovrfitting: always tst in outsid data 23
QUESTIONS? 24