Quasicomplete Separation in Logistic Regression: A Medical Example

Similar documents
Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Introduction to Survival Analysis Procedures (Chapter)

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Statistical reports Regression, 2010

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

Generalized Estimating Equations for Depression Dose Regimes

Knowledge is Power: The Basics of SAS Proc Power

Daniel Boduszek University of Huddersfield

112 Statistics I OR I Econometrics A SAS macro to test the significance of differences between parameter estimates In PROC CATMOD

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Statistical questions for statistical methods

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

How to analyze correlated and longitudinal data?

Systematic reviews and meta-analyses of observational studies (MOOSE): Checklist.

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

IAPT: Regression. Regression analyses

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way:

The FASTCLUS Procedure as an Effective Way to Analyze Clinical Data

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Unit 1 Exploring and Understanding Data

Modern Regression Methods

The impact of pre-selected variance inflation factor thresholds on the stability and predictive power of logistic regression models in credit scoring

Lev Sverdlov, Ph.D.; John F. Noble, Ph.D.; Gabriela Nicolau, Ph.D. Innapharma, Inc., Upper Saddle River, NJ

A macro of building predictive model in PROC LOGISTIC with AIC-optimal variable selection embedded in cross-validation

Treatment Adaptive Biased Coin Randomization: Generating Randomization Sequences in SAS

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

Clincial Biostatistics. Regression

11/24/2017. Do not imply a cause-and-effect relationship

Business Statistics Probability

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

Logistic regression. Department of Statistics, University of South Carolina. Stat 205: Elementary Statistics for the Biological and Life Sciences

Diurnal Pattern of Reaction Time: Statistical analysis

Linear and logistic regression analysis

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Two-stage Methods to Implement and Analyze the Biomarker-guided Clinical Trail Designs in the Presence of Biomarker Misclassification

Daniel Boduszek University of Huddersfield

RAG Rating Indicator Values

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Correlation and regression

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

STATISTICAL MODELING OF THE INCIDENCE OF BREAST CANCER IN NWFP, PAKISTAN

MODEL SELECTION STRATEGIES. Tony Panzarella

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Sociology Exam 3 Answer Key [Draft] May 9, 201 3

CONDITIONAL REGRESSION MODELS TRANSIENT STATE SURVIVAL ANALYSIS

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Answer all three questions. All questions carry equal marks.

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

While correlation analysis helps

ABSTRACT INTRODUCTION

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Media, Discussion and Attitudes Technical Appendix. 6 October 2015 BBC Media Action Andrea Scavo and Hana Rohan

Levothyroxine replacement dosage determination after thyroidectomy

OLANIRAN, Y.O.A Department of Marketing Federal Polytechnic, Ilaro Nigeria

Histopathology Whisper Proof-Of-Concept Study

A SAS Macro for Adaptive Regression Modeling

Lev Sverdlov, Ph.D., Innapharma, Inc., Park Ridge, NJ

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Part 8 Logistic Regression

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Modeling Binary outcome

Division of Biostatistics College of Public Health Qualifying Exam II Part I. 1-5 pm, June 7, 2013 Closed Book

REDUCING BIAS IN VALIDATING HEALTH MEASURES WITH PROPENSITY SCORE METHODS. Xian Liu, Ph.D. Charles C. Engel, Jr., M.D., M.PH. Kristie Gore, Ph.D.

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

What Are Your Odds? : An Interactive Web Application to Visualize Health Outcomes

1. Family context. a) Positive Disengaged

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

The Research Roadmap Checklist

Biostatistics II

Still important ideas

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Using SAS to Conduct Pilot Studies: An Instructors Guide

Lecture 21. RNA-seq: Advanced analysis

Statistics: A Brief Overview Part I. Katherine Shaver, M.S. Biostatistician Carilion Clinic

In this module I provide a few illustrations of options within lavaan for handling various situations.

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Ordinal Data Modeling

HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN PROC LOGISTIC AND PROC GENMOD

Self-assessment test of prerequisite knowledge for Biostatistics III in R

Measuring Goodness of Fit for the

STAT362 Homework Assignment 5

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

The SAS SUBTYPE Macro

CAN WE PREDICT SURGERY FOR SCIATICA?

Using Test Databases to Evaluate Record Linkage Models and Train Linkage Practitioners

ABSTRACT INTRODUCTION COVARIATE EXAMINATION. Paper

Today Retrospective analysis of binomial response across two levels of a single factor.

Data Analysis Using Regression and Multilevel/Hierarchical Models

Design and Analysis of QT/QTc Studies Conceptional and Methodical Considerations Based on Experience

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

Creating Multiple Cohorts Using the SAS DATA Step Jonathan Steinberg, Educational Testing Service, Princeton, NJ

Title:Emergency ambulance service involvement with residential care homes in the support of older people with dementia: an observational study

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

Clinical Trials A Practical Guide to Design, Analysis, and Reporting

Transcription:

Quasicomplete Separation in Logistic Regression: A Medical Example Madeline J Boyle, Carolinas Medical Center, Charlotte, NC ABSTRACT Logistic regression can be used to model the relationship between a dichotomous outcome variable and explanatory variables that can be either dichotomous or continuous When using the LOGISTIC procedure in the SAS/STAT software, one problem that can arise is complete or quasi complete separation of the data points An example from a blunt intestinal injury study completed at a major metropolitan hospital in the southeast will be presented The quasicomplete separation of the data points will be presented, as will the steps taken in an attempt to remedy the problem INTRODUCTION The LOGISTIC procedure in the SAS/STAT software is useful in analyzing a binary response variable, when the response variable takes on one of two possibilities denoted by zero and one For example, if the characteristic of interest was disease and the disease was not present, Y=O, and if the disease was present, Y=l When performing logistic regression, possible hindrances to the data analysis arise from the data The existence of maximum likelihood estimates for parameters of the logistic model depend on the configuration of the sample points in the observation space If no finite maximum likelihood estimates exist, then you have the situation described here as "infinite parameters" The sample points can fall into "three mutually exclusive and exhaustive categories: complete separation, quasi complete separation, and overlap" (So, 993) These three categories will be discussed briefly, possible remedies for separation will be given, and finally, a medical example of quasi complete separation will be given, along with the attempted solution INFINITE PARAMETERS Infinite parameters refer to the situation when no finite maximum likelihood estimate exists, as can occur for a logistic regression model The existence of these estimates depends on the configuration of the sample points in the observation space as mentioned before The three types of configurations are complete separation, quasi complete separation, and overlap "A Tutorial on Logistic Regression, (So, 993) gives more information on how infinite parameters arise in each of these configurations A brief description of the configurations and possible remedies is found in Logistic Regression Examples Using the SAS System (SAS Institute Inc, 995) and is summarized below Complete Separation If a complete separation exists in the sample points, then the maximum likelihood estimate does not exist In this case there exists a vector of pseudoestimates that correcdy allocates all observations to their observed response groups Such a data configuration gives an infinite set of nonunique estimates At each iteration, the predicted probability that each observation belongs to its observed response group rapidly grows to one and the log likelihood diminishes to zero Quasieomplete Separation If a quasicomplete separation exists in the sample points, then the maximum likelihood estimate does not exist The data are not completely separated and a vector of 375

statistics pseudoestimates correctly allocates all but a non empty set of observations to their response groups Such a data configuration also gives an infinite set of nonunique estimates The log likelihood does not diminish to zero at each iteration, as it does in the case of complete separation This is the separation that exists in the medical example to be discussed later Figure I displays the quasicomplete separation from the data used by So (993) From the figure one can easily see that the two groups cannot be separated by a straight line III - 6 o 8 Figure Plot of Points causing Quasicomplete Separation 3 3, -- The line shown on the graph illustrates the quasi complete separation of the two groups; if the value of "x" for the point on the line from group number two was changed to any lower number (eg, fifty), then a case of complete separation would exist However, in this example since the data are not completely separated and at least one member from each group lies on the line, quasi complete separation exists Overlap An overlap of the sample points exists when neither a complete nor quasi complete separation of the data exists If there is overlap in the sample points, then the maximum likelihood estimate exists and is unique Figure displays the overlap of the data points from the data used 33 by So (993) Every straight line that can be drawn on this graph will always have a sample point from each of the two groups on the same side of the line; therefore there is overlap of the data Remedies o 8 Figure Plot of PoInts Showing Overlap of Data PoInts ' 3 If there is complete or quasicomplete separation in your data, the maximum likelihood estimates do not exist, and although version 6 of the SAS system will continue to run, the statistics from the model may not be valid Various remedies are available; first, examine the original raw data for errors and if any ~e found, repeat the analysis to see if the separation still exists If this does not work, there are some options involving the data: I) categorize quantitative variables, ) use fewer or different explanatory variables, or 3) collect more data "With increasing sample size, the probability of observing a set of separated data points tends to zero, no matter what the sample scheme" (Albert and Anderson, 984) The modei may also he altered to remedy the separation Try reclassifying the response variable, or in a model with a selection setting if you encounter complete separation when you use, for example, backwards 'elimination selection method, try using forward or stepwise selection instead Complete and quasi complete separation usually occur with small samples and qualitative data 3 33 376

However, complete or quasi complete separation can occur for any type of data or sample size An important note to keep in mind is that the more explanatory variables your model contains, the greater the likelihood of encountering complete or quasi complete separation MEDICAL EXAMPLE A retrospective study over six years of patients with blunt intestinal injury was completed at the Carolinas Medical Center in Charlotte, NC The study objective was to identify factors associated with a delay of more than six hours between the time of injury and therapeutic laparotomy The statistical analysis included a stepwise logistic regression to determine whether a set of explanatory variables could predict the outcome of delayed laparotomy, having a lifethreatening injury, and the location of injury (small bowel or colon) The analyses were completed using the SAS system for Windows, version 6 The original explanatory set of thirty-three variables contained categorical, dichotomous, and continuous variables These variables included mechanism of injury, abdominal exam results, fractures, Computerized Tomography (CT) exam results, blood alcohol level, Diagnostic Peritoneal Lavage (DPL) exam results, and hypotensive status The sample included sixty-one patients who were confirmed by laparotomy to have sustained blunt intestinal injury with thirty of those patients having a laparotomy more than six hours post injury An obvious drawback to a stepwise logistic regression with such a small sample size and so many explanatory variables was the lack of ability in the model to replicate the results for another set of patients A rule of thumb proposed by Harrell, et al, (985) was that " one should not attempt a stepwise regression when there are fewer than ten times as many events in the training sample as there are candidate predictor variables" When the response variable is binary, the limiting sample size is the sample size of the less frequent response category In this example this is the thirty patients with a delayed laparotomy Using Harrell's rule of thumb, three explanatory variables could be introduced into the model However, since the focus of this example is on the quasi complete separation of the data and because the data was collected for a six year period, we will not comment further on the sample size When we attempted to use the LOGISTIC procedure on our model for delayed laparotomy with thirty-three candidate explanatory variables, an intercept and three other variables were entered into the model, and then a warning message that a quasi complete separation of the data existed and the Maximum Likelihood Estimates did not exist was printed on the output At this point the procedure continued fitting the model and statistics, but at each step noted the model validity was questionable; this step is new to the latest release, version 6, of the SAS system In the previous versions, the model fitting stopped as soon as the separation was found and a warning was indicated in the output The same result was returned when modeling whether or not the patient had a lifethreatening illness When the location of injury (small bowel or colon) was examined as a response variable, the stepwise logistic failed to find an adequate model, based on the low sensitivity and specificity of the model The highest sensitivity and specificity found were 5% lih 43/, r~vely However, this response variable did not encounter the problem with quasi complete separation of the sample data points The initial steps taken in an attempt to remedy the quasicomplete separation of the data points and the weakness of the third model included verifying the data, and combining explanatory variables to reduce the number entered into the model to seven The new set of explanatory 377

sf4listics variables included mechanism of injury, DPL gross and micro exam results, blood alcohol level, and four groups depending on any injuries or fractures found on initial examination, also depending on what variable was being modeled, two of the three following variables were included: location of injury, delay of more than six hours before surgery or not, and lifethreatening illness or not, when appropriate Reducing the number of explanatory variables eliminated the quasi complete separation in the two response variables, delay and injury type The third response variable however, location of injury, now exhibited the problem of quasi complete separation of the sample data points Attempts to eliminate the quasi complete separation of the data for the variable location of injury were unsuccessful Backward stepwise regression was attempted, as well as reclassifying the location of injury Since the quasi complete separation of the data was unable to be resolved, the statistics from this stepwise logistic regression model for this variable was not interpretable The table below displays the quasi complete separation in this example for the response variable, location of injury by looking at the number of patients with pelvic injuries Table of Location of Injury for Patients with Pelvic Injuries Location of Injury Colon Small Bowel PeMclnjury 4 Other Injury 7 3 As seen in the table, none of those patients who have small bowel injuries had pelvic injuries This results in a quasicomplete separation of the data Those patients who had a pelvic injury were exclusively in the group of patients who had a colon injury The patients with another type of injury had their location of injury as either the small bowel or colon Therefore, quasi complete separation of the sample points existed; the data are separated into two groups wi~ the exception of a non empty set of observations In the simple example illustrated in Figure I, the majority of points were correctly allocated to their groups Only three points in that set were not correctly allocated In this case the majority of patients were in the nonallocated set, while four were correctly allocated to the group who had a location of injury at their colon The partial output for this example is given in Appendix A In this output the warning of quasicomplete separation of the data can be seen in the fourth step of the procedure, as well as the log likelihood that does not diminish to zero as it does in complete separation The variables entered into the model are as follows: Threaten (whether the patient had a life-threatening illness or not), Alcohol-(whether patient's alcohol level was greater than zero or no alcohol was found/test not done), MY AU-(whether patient was involved in an unrestrained motor vehicle accident) and Grp_-(whether the patient had a pelvic injury or other type of injury) The odds ratios and other statistics are calculated for this model with a warning given about questionable model validity, which refers to the existence of quasicomplete separation of the data points These additional statistics are not based on the maximum likelihood estimates because these values do not exist; therefore, these statistics should not be used until the model validity has been determined Running the model with another set of data is one manner of verifying, model validity 378

CONCLUSION Quasicomplete separation occurs when the data are not completely separated and a vector of pseudo estimates correctly allocates all but a nonempty set of observations to their response groups This was illustrated with a medical example Some remedies can be attempted to relieve the separation of the sample data, including increasing the sample size, categorizing quantitative variables and reducing the number of explanatory variables The latter proved useful in two of the three logistic regression models attempted in our example For the third model using location of injury, as the response variable, the quasi complete separation could not be eliminated and a successful model could not be achieved ACKNOWLEDGEMENTS SAS and SAS/STAT are registered trademarks or trademarks of the SAS Institute Inc in the USA and other countries indicates USA registration SAS Institute Inc, Logistic Regression Examples Using the SAS System, Version 6, First Edition, Cary, NC: SAS Institute Inc, 995 SAS Institute Inc, SAS/STAT Software: Changes and Enhancements through Release 6, Cary, NC; SAS Institute Inc, 996 So, Y (993) A Tutorial on Logistic Regression Proceedings of the Eighteenth Annual SAS Users Group International, 9-95 Address correspondence to: Madeline Boyle Department of Biostatistics Research Office Building, Room 3 Carolinas Medical Center PO Box 386 Charlotte, NC 83-86 Work: 74-355-459 Fax: 74-355-88 Email: mjboyle@meduncedu Other brand and product names are registered trademarks or trademarks of their respective companies REFERENCES Albert A, Anderson JA (984) On the Existence of Maximum Likelihood Estimates in Logistic Regression Models Biometrika, 7: - Harrell FE, Lee KL, Matchar DB, Reichert TA (985) Regression Models for Prognostic Prediction: Advantages, Problems, and Suggested Solutions Cancer Treatment Reports, 69: 7-7 379

~PENDIXA: PARTIAL OUTPUT FROM PROC LOGISTIC FOR TIlE QUASICOMPLETE EXAMPLE The LOGISTIC Procedure Data Set: INJURIES Response Variable: LOCATE l=small bowel & O=colon Response Levels: - Number of Observations: 6 Link Function: Logit Response Profile Ordered Value LOCATE o Count 3 3 Step 4 Variable Grp_ entered: Maximum Likelihood Iteration Phase Iter Step INITIAL IRLS -Log L Intercept 8454756 37 6379 55 Threaten -689 Grp_ 394 Alcohol -385 MVAU 694 IRLS 69898 549-6 476-95 44 IRLS 69895 549-6 476-95 44 WARNING: There is possibly a quasi complete separation in the sample points The maximum likelihood estimate may not exist WARNING: The LOGISTIC procedure continues in spite of the above warning Results shown are based on the last maximum likelihood iteration Validity of the model fit is in question Summary of Stepwise Procedure Step 3 4 5 Variable Entered Threaten Alcohol MVAU Grp_ Variable Removed Number In 3 4 3 Score Wald Pr> Chi-Square Chi-Square Chi-Square 6344 6 39853 459 457 398 867 789 9 974 38