Least likely observations in regression models for categorical outcomes

Similar documents
Editor Executive Editor Associate Editors Copyright Statement:

Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

Estimating Heterogeneous Choice Models with Stata

Modeling unobserved heterogeneity in Stata

Notes for laboratory session 2

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

Sociology 63993, Exam1 February 12, 2015 Richard Williams, University of Notre Dame,

Sociology Exam 3 Answer Key [Draft] May 9, 201 3

Final Exam - section 2. Thursday, December hours, 30 minutes

Multiple Linear Regression Analysis

Regression Output: Table 5 (Random Effects OLS) Random-effects GLS regression Number of obs = 1806 Group variable (i): subject Number of groups = 70

ECON Introductory Econometrics Seminar 7

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

ADVANCED STATISTICAL METHODS: PART 1: INTRODUCTION TO PROPENSITY SCORES IN STATA. Learning objectives:

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

MULTIPLE REGRESSION OF CPS DATA

Generalized Mixed Linear Models Practical 2

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Daniel Boduszek University of Huddersfield

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see

Chapter 34 Detecting Artifacts in Panel Studies by Latent Class Analysis

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

Daniel Boduszek University of Huddersfield

Limited dependent variable regression models

ANOVA. Thomas Elliott. January 29, 2013

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Immortal Time Bias and Issues in Critical Care. Dave Glidden May 24, 2007

Multivariate dose-response meta-analysis: an update on glst

11/24/2017. Do not imply a cause-and-effect relationship

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

ethnicity recording in primary care

Generalized Linear Models of Malaria Incidence in Jubek State, South Sudan

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

In this module I provide a few illustrations of options within lavaan for handling various situations.

The Stata Journal. Editor Nicholas J. Cox Geography Department Durham University South Road Durham City DH1 3LE UK

Student name: SOCI 420 Advanced Methods of Social Research Fall 2017

An Introduction to Modern Econometrics Using Stata

Estimating average treatment effects from observational data using teffects

Final Exam. Thursday the 23rd of March, Question Total Points Score Number Possible Received Total 100

Sarwar Islam Mozumder 1, Mark J Rutherford 1 & Paul C Lambert 1, Stata London Users Group Meeting

Heuristics as a Proxy for Contestant Risk Aversion in Deal or No Deal

Statistical reports Regression, 2010

Introduction to regression

Business Statistics Probability

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

Binary Diagnostic Tests Two Independent Samples

Introduction to Logistic Regression

Use of GEEs in STATA

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Still important ideas

MIDAS RETOUCH REGARDING DIAGNOSTIC ACCURACY META-ANALYSIS

For any unreported outcomes, umeta sets the outcome and its variance at 0 and 1E12, respectively.

Modeling Binary outcome

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Evaluating the Matlab Interventions Using Non-standardOctober Methods10, and2012 Outcomes 1 / 31

STATISTICAL MODELING OF THE INCIDENCE OF BREAST CANCER IN NWFP, PAKISTAN

Implementing double-robust estimators of causal effects

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Regression so far... Lecture 22 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Cross-validated AUC in Stata: CVAUROC

4. STATA output of the analysis

Dan Byrd UC Office of the President

Online Appendix. According to a recent survey, most economists expect the economic downturn in the United

Analysis and Interpretation of Data Part 1

Stat 13, Lab 11-12, Correlation and Regression Analysis

Conducting interrupted time-series analysis for single- and multiple-group comparisons

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Today Retrospective analysis of binomial response across two levels of a single factor.

Part 8 Logistic Regression

Using a Likert-type Scale DR. MIKE MARRAPODI

Score Tests of Normality in Bivariate Probit Models

Lessons in biostatistics

Fitting discrete-data regression models in social science

Data Analysis with SPSS

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Person centered treatment (PeT) effects: Individualized treatment effects using instrumental variables

Logistic regression: Why we often can do what we think we can do 1.

Benchmark Dose Modeling Cancer Models. Allen Davis, MSPH Jeff Gift, Ph.D. Jay Zhao, Ph.D. National Center for Environmental Assessment, U.S.

Choosing the Correct Statistical Test

MEASURES OF ASSOCIATION AND REGRESSION

Online Appendix. Sampling and Administrative Details

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Still important ideas

Generalized Linear Models and Logistic Regression

1. You want to find out what factors predict achievement in English. Develop a model that

isc ove ring i Statistics sing SPSS

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Choosing an Approach for a Quantitative Dissertation: Strategies for Various Variable Types

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Calculating bronchiolitis severity using Ordinal Regression with a new command in Stata

Basic Biostatistics. Chapter 1. Content

Business Research Methods. Introduction to Data Analysis

Analyzing binary outcomes, going beyond logistic regression

Transcription:

The Stata Journal (2002) 2, Number 3, pp. 296 300 Least likely observations in regression models for categorical outcomes Jeremy Freese University of Wisconsin Madison Abstract. This article presents a method and program for identifying poorly fitting observations for maximum-likelihood regression models for categorical dependent variables. After estimating a model, the program leastlikely will list the observations that have the lowest predicted probabilities of observing the value of the outcome category that was actually observed. For example, when run after estimating a binary logistic regression model, leastlikely will list the observations with a positive outcome that had the lowest predicted probabilities of a positive outcome and the observations with a negative outcome that had the lowest predicted probabilities of a negative outcome. These can be considered the observations in which the outcome is most surprising given the values of the independent variables and the parameter estimates and, like observations with large residuals in ordinary least squares regression, may warrant individual inspection. Use of the program is illustrated with examples using binary and ordered logistic regression. Keywords: st0022, outliers, predicted probabilities, categorical dependent variables, logistic regression 1 Overview After estimating a linear regression model with a continuous outcome, data analysts will commonly calculate the residuals for each observation and examine those with the largest residuals more carefully to check if there is some discernible reason why the parameter estimates fit these observations so poorly. The commonsense understanding of observations with large residuals is that they are the observations for which the values of the dependent variable are most surprising ( unexpected, weird ), given the regression coefficients and the values of the independent variables. In regression models for categorical outcomes, such as those discussed in Long (1997), residuals are not as readily conceptualized, although attempted extensions from models for continuous outcomes have been made (Pregibon 1981). Stata estimates many of these models using maximum likelihood, and, in many cases, the parameter estimates can be thought of as the set of estimates that maximizes the joint probability of observing the values of the outcome categories that were actually observed. Accordingly, for the observations within each outcome category, the most surprising are those that have the lowest predicted probabilities of observing that outcome, and these may warrant closer inspection by analysts for precisely the reason that observations with large residuals do in the more familiar linear regression model. c 2002 Stata Corporation st0022

J. Freese 297 When run after various models for categorical outcomes, the command leastlikely will list the least likely observations. In other words, after a binary model, leastlikely will list both the observations with the lowest Pr (y =0 y =0)and Pr (y 0 y 0). leastlikely may be used after many models for binary outcomes in which the option p after predict generates the predicted probabilities of a positive outcome (e.g., logit, probit, cloglog, scobit, hetprob) and after many models for ordered or nominal outcomes in which the option outcome(#) after predict generates the predicted probability of outcome # (e.g., ologit, oprobit, mlogit). leastlikely is not appropriate for models in which the probabilities produced by predict are probabilities within groups or panels or for blocked data, and it will produce an error if executed after blogit, bprobit, clogit, glogit, gprobit, nlogit, or xtlogit. 2 Syntax leastlikely [ varlist ] [ if exp ] [ in range ] [, n(#) generate(varname) [ no ] display nolabel noobs doublespace ] where varlist contains any variables whose values are to be listed in addition to the observation numbers and probabilities. 3 Options n(#) specifies the number of observations to be listed for each outcome. The default is 5. In the case of multiple observations with identical predicted probabilities, all will be listed. generate(varname) specifies that the probabilities of observing the outcome value that was observed should be stored in varname. Ifnot specified, the variable name Prob will be created but dropped after the output is produced. The remaining options are standard options available after list. [no]display forces the format into display or tabular (nodisplay) format. If you do not specify one of these two options, then Stata chooses one based on its judgment of which would be most readable. nolabel causes the numeric codes rather than the label values to be displayed. noobs suppresses printing of the observation numbers. doublespace produces a blank line between each observation in the listing when in nodisplay mode; it has no effect in display mode.

298 Least likely observations 4 Examples Using data from Mroz (1987; see Long and Freese 2001, Chapter 4), I estimate a logistic regression model of the effects of several independent variables on a woman s probability of being in the labor force (lfp), and then I use leastlikely to list the predicted probabilities and observation numbers of the least likely observations.. use binlfp2, clear (PSID 1976 / T Mroz). logit lfp k5 k618 age wc hc lwg inc Iteration 0: log likelihood = -514.8732 Iteration 1: log likelihood = -454.32339 Iteration 2: log likelihood = -452.64187 Iteration 3: log likelihood = -452.63296 Iteration 4: log likelihood = -452.63296 Logit estimates Number of obs = 753 LR chi2(7) = 124.48 Prob > chi2 = 0.0000 Log likelihood = -452.63296 Pseudo R2 = 0.1209 lfp Coef. Std. Err. z P> z [95% Conf. Interval] k5-1.462913.1970006-7.43 0.000-1.849027-1.076799 k618 -.0645707.0680008-0.95 0.342 -.1978499.0687085 age -.0628706.0127831-4.92 0.000 -.0879249 -.0378162 wc.8072738.2299799 3.51 0.000.3565215 1.258026 hc.1117336.2060397 0.54 0.588 -.2920969.515564 lwg.6046931.1508176 4.01 0.000.3090961.9002901 inc -.0344464.0082084-4.20 0.000 -.0505346 -.0183583 _cons 3.18214.6443751 4.94 0.000 1.919188 4.445092. leastlikely Outcome: 0 (NotInLF) Prob 60..1231792 172..1490344 221..1470691 235..1666356 252..1088271 Outcome: 1 (inlf) Prob 338..1760865 534..0910262 568..178205 635..0916614 662..1092709 Of respondents not in the labor force (lfp=0), observation #252 had the lowest predicted probability of not being in the labor force. For women in the labor force (lfp=0), observation #534 has the lowest predicted probability of being there. Using data from Long and Freese (2001, Chapter 5; see also Long 1997), I estimate an ordered logistic regression model of the probability of agreement with the proposition that working mothers can establish as warm a relationship with their children as mothers who do not work (warm).

J. Freese 299. use ordwarm2, clear (77 & 89 General Social Survey). ologit warm yr89 male white age ed prst Iteration 0: log likelihood = -2995.7704 Iteration 1: log likelihood = -2846.4532 Iteration 2: log likelihood = -2844.9142 Iteration 3: log likelihood = -2844.9123 Ordered logit estimates Number of obs = 2293 LR chi2(6) = 301.72 Prob > chi2 = 0.0000 Log likelihood = -2844.9123 Pseudo R2 = 0.0504 warm Coef. Std. Err. z P> z [95% Conf. Interval] yr89.5239025.0798988 6.56 0.000.3673037.6805013 male -.7332997.0784827-9.34 0.000 -.8871229 -.5794766 white -.3911595.1183808-3.30 0.001 -.6231815 -.1591374 age -.0216655.0024683-8.78 0.000 -.0265032 -.0168278 ed.0671728.015975 4.20 0.000.0358624.0984831 prst.0060727.0032929 1.84 0.065 -.0003813.0125267 _cut1-2.465362.2389126 (Ancillary parameters) _cut2 -.630904.2333155 _cut3 1.261854.2340179 Iuse the option n(3) after leastlikely to restrict output to only the three most unlikely observations within each of the four outcome categories. Specifying age and male tells Stata to list values of these variables along with the observation numbers and probabilities.. leastlikely age male, n(3) Outcome: 1 (SD) 167..0401364 37 Women 222..0449925 29 Women 271..0407333 20 Women Outcome: 2 (D) 563..1072643 41 Women 803..1028648 30 Women 1001..1307181 32 Women Outcome: 3 (A) 1344..1559092 72 Men 1449..1358758 71 Men 1729..1283106 81 Men Outcome: 4 (SA) 1963..0387174 64 Men 2107..0413501 69 Men 2138..0393529 57 Men

300 Least likely observations As we would expect given the direction of the ologit coefficients, the respondents who strongly disagreed with the proposition but were least likely to do so were all younger women, while the least likely respondents who nonetheless strongly agreed with the proposition were all older males. 5 References Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. Long, J. S. and J. Freese. 2001. Regression Models for Categorical Dependent Variables using Stata. College Station, TX: Stata Press. Mroz, T. A. 1987. The sensitivity of an empirical model of married women s hours of work to economic and statistical assumptions. Econometrica 55: 765 799. Pregibon, D. 1981. Logistic regression diagnostics. Annals of Statistics 9: 705 724. About the Author Jeremy Freese is Assistant Professor of Sociology at the University of Wisconsin Madison.