Some Nomenclatures Multiple Analysis Daniel Y.T. Fong Dependent/ Outcome variable Independent/ Explanatory variable Univariate Analyses 1 1 1 2 Simple Analysis Multiple Analysis /Multivariable Analysis NURS8222 Statistical Practice in Health Sciences SCHOOL OF NURSING The University of Hong Kong 2 Multivariate Analyses 1 Out of our scope! Learning Objectives A Weight Lifting Analysis A Simple Regression Analysis 1. To understand the possible complications with multiple independent variables 2. To learn methods of selecting predictive factors Weight-lifter Independent variable Weight Outcome data of all subjects Floor Data of the independent variable from all subjects A full-length line means the independent variable of the subject was measured A shortened line means the independent variable of the subject was NOT measured
Who is More Likely to Win? Multiple Multivariate Stronger Weaker Higher R 2 Lower R 2 A comment from the Referee: 1 weight-lifter Simple Analysis 2 weight-lifters Multiple Analysis The Published Version Multiple Regression New Complications ~ Hepatology (2002)
Unadjusted and Adjusted Effects of Age Simple Regression Analysis Multiple Regression - Ideal Unadjusted Effect Adjusted Effect PF = a + b 1 age(in years) For 2 subjects whose age differs by 1 year, their PF will be different by b 1, on average. Multiple Regression Analysis PF = a + b 1 age + b 2 gender For 2 subjects of the same gender, if their ages differ by 1 year, their PF will be different by b 1 on average. Age provided information on PF in addition to gender..... All independent variables provide additional information to explain the outcome variable No missing values for all independent variables Independent variables are truly independent. Increased Strength Increased R 2 Center C ID Group C01 Placebo C02 Placebo C03 Ginseng C04 Ginseng C05 Ginseng C06 Placebo C07 Placebo C08 Ginseng C09 Ginseng C10 Placebo C11 Placebo C12 Placebo C13 Placebo C14 Placebo C15 Ginseng C16 Placebo C17 Ginseng C18 Ginseng Effect of Ginseng Setting: multi-center Subjects: 133 cancer patients Outcome: 12-week change (Week 12 Week 0) of General Health (SF-36) Placebo-controlled Stratified randomization by center Ginseng Placebo Total Center C 31 32 63 Center D 35 35 70 66 67 133 Center D ID Group D01 Placebo D02 Ginseng D03 Ginseng D04 Placebo D05 Placebo D06 Placebo D07 Ginseng D08 Placebo D09 Placebo D10 Ginseng D11 Placebo D12 Placebo D13 Ginseng D14 Ginseng D15 Placebo D16 Ginseng D17 Ginseng D18 Ginseng Multiple Regression Simple Regression Higher precision n = 133, R 2 = 12.1% n = 133, R 2 = 6.8% n = 133, R 2 = 5.4%
Reality 1 Non-useful Variables Age is Not Useful n = 133, R 2 = 13.0% Variable 2 provides no additional information to explain the outcome variable Independent variable(s) n R 2 Model 1 Group 133 6.8% Model 2 Center 133 5.4% Model 3 Age 133 1.4% Model 4 Group, Center 133 12.1% Reality 2 Missing Values Stage has Missing Values n = 103, R 2 = 40.8% Variables 1 and 2 (has more missing values) may decrease the overall power despite giving additional information Variable 3 (has no missing values) can effectively give more information n = 133, R 2 = 12.1%
Reality 3 Multi-collinearity An Awkward Result n = 133, R 2 = 13.6% Insignificant!! Independent variable(s) n R 2 P Data from independent variables 2 and 3 are associated The data may not allow the consideration of both variables simultaneously. Model 1 Group 133 6.8% 0.020 Model 2 Center 133 5.4% 0.002 Model 5 Compliance 133 8.1% 0.001 Model 4 Group, Center 133 12.1% <0.001 Were Group and Compliance Associated? Detecting Multi-collinearity By which test? P < 0.001 Only one of them can stay Sorry! I m also useful! Chinese Proverb: One hill cannot shelter two tigers Variance Inflation Factor (VIF = 1/Tolerance) High VIF or low Tolerance means severe problem with multi-collinearity No defining threshold for large Some suggested Tolerance 0.4
For an Independent Variable, Univariable Regression Multivariable Regression Possible Explanation(s)? From Insignificant to Significant An Example Multiple Analysis Significant Significant Insignificant Insignificant Significant Insignificant ~ Altman (1991) Simple Analyses Q & A 1. In a multivariable analysis, It is preferable to consider only independent variables that are significant in their univariable analyses An independent variable may be more likely to be significant than in a univariable analysis An insignificant independent variable is a result of either it is not useful or missing values Choosing the Best Model True or False?
Selecting the Best Predictors? Can we use R 2? General Health (dependent variable) R 2 of a regression model measures the usefulness of the model predictors in predicting/explaining the dependent variable stage age center group compliance Addition of predictors always increases R 2 This results in having more predictors than needed (over-fitting) Adjusted R 2 Adjusted R 2 2 n1 = 1(1 R ) nk 1 where k is the number of predictors It accounted for the number of model predictors It can be negative It can be reduced with addition of predictors We have not exhausted all possible models!! Optimal? Searching through All Possible Models?! k predictors 2 k possible models 2 predictors 4 possible models 3 predictors 8 possible models 5 predictors 32 possible models 10 predictors 1024 possible models Need better strategy!
Automatic Variable Selection Procedures 1. Forward entry Start with the empty model, i.e. no independent variables Add the candidate variable that is the most significant when added to the model Repeat until no more significant candidate variable when added Model Automatic Variable Selection Procedures 2. Backward removal Start with the full model, i.e. all independent variables Remove the variable in the model that is the most insignificant Keep doing until no more insignificant variables in the model Model Automatic Variable Selection Procedures 3. (Forward) Stepwise Start with the empty model, i.e. all independent variables Add the candidate variable that is the most significant, when added to the model Remove the variable in the model that is the most insignificant Keep doing and until no more insignificant variables in the model Go back to unless no more significant variable Model when added Forward Backward Forward Stepwise Contrasting the Variable Selection Procedures Key Characteristics 1. Start with empty model 2. Once a predictor is included, it will never be excluded 1. Start with full model 2. Once a predictor is excluded, it will never be included 1. Start with empty model 2. A predictor can be included and excluded from the model Remarks 1. Easier to implement when compared with the stepwise method 1. Not desirable when there are many independent variables 1. More flexible 2. Less easy to implement
Selecting the Best Predictors? The set of best predictors may NOT be unique especially when there is multi-collinearity An excluded predictor may be significant on its own (simple regression) e.g. group, and center are significant factors Selecting predictors based on theoretical principles is often desirable e.g. may prefer BMI to height and weight A Major Critique of using Variable Selection Procedures The inflated chance of false positive error due to multiple testings and the presence of influential observations. Be Clear to What you Want 1. those with determined association to examine No need or should not use any automated variable selection procedures here. 2. those for prediction May use automated variable selection procedures Preferably consider independent or cross validation of results 3. those for exploration Cautious with the use of automatic variable selection procedures and perform validation by all means Preferably incorporate clinical insights on the association among the variables, especially on their causal relationships, and apply some structured regression analysis. Alternatively, one may also consider more advanced statistical procedures such as the L1 regularization and the least angle regression (need a friend in statistics!) Q & A 2. In a regression analysis when a variable selection procedure is performed, all independent variables that are not selected have no effects on the outcome/dependent variable we may only consider independent variables that are significant in their simple analyses R 2 is always larger than adjusted R 2 True or False?
Effect of Ginseng - continue Interaction / Effects Modification Mean (SD) Placebo Ginseng DifferenceSE (p) (Ginseng Placebo) Center C + D 16.1 (14.8) 23.6 (13.0) 7.52.4 (P=0.002) Center C 15.4 (9.7) 17.2 (6.2) 1.72.1 (P=0.402) Center D 16.7 (18.5) 29.3 (14.7) 12.64.0 (P=0.002) Is the Ginseng effect different between the two centers? 12-week Change of GH Interaction effect 35 30 25 20 15 10 12.6 Center C Center D 1.7 Interaction effect of Group and Center = effect of Ginseng in Center D effect of Ginseng in Center C = 12.6 1.7 = 10.9 SE = 4.6 (based on the whole sample) P = 0.021 5 0 Placebo Ginseng
Deciding the Best Model Some Suggestions Automated variable selection should not include any interaction effects Examination of interaction effects is often based on clinical interest Remove all insignificant interaction terms before interpreting their corresponding main effects Keep all main effects when their interaction is significant Incorporating clinical theory Check for model validity FAQs 1. Will R 2 be higher even if I add a nonuseful independent variable? 2. Will two independent variables with multi-collinearity have interaction?