m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

Similar documents
Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

Notes for laboratory session 2

Final Exam - section 2. Thursday, December hours, 30 minutes

Sociology 63993, Exam1 February 12, 2015 Richard Williams, University of Notre Dame,

Sociology Exam 3 Answer Key [Draft] May 9, 201 3

Multiple Linear Regression Analysis

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Least likely observations in regression models for categorical outcomes

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

ADVANCED STATISTICAL METHODS: PART 1: INTRODUCTION TO PROPENSITY SCORES IN STATA. Learning objectives:

ECON Introductory Econometrics Seminar 7

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

Regression Output: Table 5 (Random Effects OLS) Random-effects GLS regression Number of obs = 1806 Group variable (i): subject Number of groups = 70

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Statistical reports Regression, 2010

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Introduction to regression

NORTH SOUTH UNIVERSITY TUTORIAL 2

STA 3024 Spring 2013 EXAM 3 Test Form Code A UF ID #

Joseph W Hogan Brown University & AMPATH February 16, 2010

Statistical questions for statistical methods

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way:

Multivariate dose-response meta-analysis: an update on glst

ANOVA. Thomas Elliott. January 29, 2013

Study Guide #2: MULTIPLE REGRESSION in education

Modeling Binary outcome

Daniel Boduszek University of Huddersfield

Today Retrospective analysis of binomial response across two levels of a single factor.

appstats26.notebook April 17, 2015

(a) Perform a cost-benefit analysis of diabetes screening for this group. Does it favor screening?

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

Survey of Smoking, Drinking and Drug Use (SDD) among young people in England, Andrew Bryant

Student name: SOCI 420 Advanced Methods of Social Research Fall 2017

The results of the clinical exam first have to be turned into a numeric variable.

Regression Including the Interaction Between Quantitative Variables

Business Statistics Probability

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

Simple Linear Regression One Categorical Independent Variable with Several Categories

1. You want to find out what factors predict achievement in English. Develop a model that

Controlling Bias & Confounding

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

Part 8 Logistic Regression

Example HLA-B and abacavir. Roujeau 2014

Final Exam. Thursday the 23rd of March, Question Total Points Score Number Possible Received Total 100

Lecture (chapter 12): Bivariate association for nominal- and ordinal-level variables

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

Use the above variables and any you might need to construct to specify the MODEL A/C comparisons you would use to ask the following questions.

Clincial Biostatistics. Regression

Analyzing binary outcomes, going beyond logistic regression

Why Mixed Effects Models?

What Are Your Odds? : An Interactive Web Application to Visualize Health Outcomes

The Stata Journal. Editor Nicholas J. Cox Geography Department Durham University South Road Durham City DH1 3LE UK

Generalized Mixed Linear Models Practical 2

Lesson 9: Two Factor ANOVAS

18 INSTRUCTOR GUIDELINES

TABLE 1. Percentage of respondents to a national survey of young adults, by selected characteristics, according to gender, United States, 2009

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

CHAPTER 9. X Coefficient(s) Std Err of Coef. Coefficient / S.E. Constant Std Err of Y Est R Squared

Analysis of Variance: repeated measures

Gender differences in the prevalence and patterns of alcohol use among 9th-12th graders in a St. Louis suburban high school

Chapter 13 Estimating the Modified Odds Ratio

Generalized Estimating Equations for Depression Dose Regimes

1. Family context. a) Positive Disengaged

POL 242Y Final Test (Take Home) Name

STAT 503X Case Study 1: Restaurant Tipping

Student name: SOCI 420 Advanced Methods of Social Research Fall 2017

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see

G , G , G MHRN

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Limited dependent variable regression models

11/24/2017. Do not imply a cause-and-effect relationship

The Lens Model and Linear Models of Judgment

Regression so far... Lecture 22 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Statistical Reasoning in Public Health 2009 Biostatistics 612, Homework #2

4. STATA output of the analysis

UNEQUAL CELL SIZES DO MATTER

Political Science 15, Winter 2014 Final Review

IAPT: Regression. Regression analyses

5 To Invest or not to Invest? That is the Question.

Modeling unobserved heterogeneity in Stata

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Searching for flu. Detecting influenza epidemics using search engine query data. Monday, February 18, 13

CHAPTER 15. X Coefficient(s) Std Err of Coef. Coefficient / S.E. Constant Std Err of Y Est R Squared

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Estimating average treatment effects from observational data using teffects

Co-Variation in Sexual and Non-Sexual Risk Behaviors Over Time Among U.S. High School Students:

Measures of Association

Answer Key to Problem Set #1

Data Analysis in the Health Sciences. Final Exam 2010 EPIB 621

_- Part 2e Urinary-System Ca, Females Pacific Constant CHAPTER 12

Logistic regression. Department of Statistics, University of South Carolina. Stat 205: Elementary Statistics for the Biological and Life Sciences

Daniel Boduszek University of Huddersfield

Economics 345 Applied Econometrics

Transcription:

SOCY5061 RELATIVE RISKS, RELATIVE ODDS, LOGISTIC REGRESSION RELATIVE RISKS: Suppose we are interested in the association between lung cancer and smoking. Consider the following table for the whole population: Lung cancer Smoker Smoker Yes No Total Yes No Total Yes LC&S LC&nS LC m 11 m 1. No nlc&s nlc&ns nlc m 22 m 2. Total S ns T m.1 m.2 m.. m 11 = number of people who both have lung cancer and smoke = number of people who both have lung cancer and don't smoke = number of people who both do not have lung cancer and smoke m 22 = number of people who both do not have lung cancer and do not smoke The dot in the subscript shows that the values were summed over all values in that position e.g. m 1. = m 11 + We could ask whether the risk of lung cancer is higher among smokers than non-smokers: is m 11 m.1 > m.2? This question can be asked in another way: is the relative risk of cancer for smokers compared to non-smokers greater than 1? The relative risk is the ratio of the risk for smokers compared to non-smokers, i.e. m 11 relative risk risk for smokers risk for nonsmokers m.1 m.2 A relative risk greater than 1 means the smokers have higher risk of lung cancer than non-smokers. A made-up example: Lung cancer Smoker Yes No Total Yes 120 12 132 No 1,199,880 1,199,988 2,399,868 Total 1,200,000 1,200,000 2,400,000 Risk.0001.00001 Relative Risk:.0001/.00001 = 10 1/10,000 1/100,000 The risk is 10 times higher for smokers than non-smokers - but because lung cancer is rare, we had to look at a lot of

SOCY5061 2 people - here over 2 million - to find the relative risk.

SOCY5061 3 RELATIVE ODDS: Instead of calculating the risk of lung cancer, we can calculate the odds of lung cancer for smokers and for non-smokers. The odds of lung cancer is the number with lung cancer/ number without lung cancer. For smokers: m 11 / = 120/ 1,199,880 For nonsmokers: / m 22 = 12 / 1,199,988 For those of you who know something about betting on horse races, then you know that the chances that a horse will will are given in terms of odds - odds of 3:2 mean the bookies guess that the horse has a 60% chance of winning. by The relative odds is the ratio of the odds. The relative odds for smokers compared to nonsmokers is given odds for smokers odds for nonsmokers m 11 m 22 m 11 m 22 In our case, the relative odds and the relative risk are near identical - which happens when risks are small. Relative odds are important here for two reasons. It turns out that they can also be calculated in what are called case-control studies. They are also used in logistic regression. CASE - CONTROL STUDIES In a case-control study, we pick cases of the rare condition - here lung cancer - and match to them individuals with similar characteristics but who do not have lung cancer. In this case we are sampling on the dependent variable. What we will demonstrate is that, from this type of study, the odds ratio can be estimated. Lung cancer Smoker Yes No Total Yes a b 1000 No c d 1000 In this case, we have selected 1000 people with lung cancer and 1000 without. We are trying to estimate the relative odds of lung cancer for smokers as compared to nonsmokers. The relative odds in this case are (a/c)/ (b/d). The question is whether this is the same as the relative odds when we have a true population sample, (m 11 m 22 / ). To see that this is indeed the case, we note that a = 1000 x (m 11 / m 1. ) b = 1000 x ( / m 1. ) c = 1000 x ( / m 2. ) d = 1000 x (m 22 / m 2. ) the expression in () is the proportion of smokers among those with lung cancer where the () is the proportion of non-smokers of those with lung cancer where the () is the proportion of smokers among those without lung cancer where the () is the proportion of non-smokers among those without lung cancer Then ad bc m 11 /m 1. /m 1. x m 22 /m 2. /m 2. m 11 m 22

SOCY5061 4 Therefore, case-control studies allow us to estimate relative odds (which is also referred to as the odds ratio. LOGISTIC REGRESSION Logistic regression is a direct extension of the analysis of odds and odds ratios. Logistic regression is used when the dependent variable is a dichotomy - e.g. cancer, yes or no. What logistic regression does is consider the odds of being in category 1 of the dependent variable. If = probability of being in category 1 (e.g. of having lung cancer), and 1- = probability of being in category 0 (of not having lung cancer), then the odds of having cancer are p/(1-p). The log odds is called the logit of p, i.e., logit L log 1& Therefore, logit regression is concerned with predicting the probability or the odds of being in category 1. One of the benefits of using logs is that we never estimate a probability greater than 1 or less than zero, which could happen if we used OLS. (The logit of 0 is -4 and the logit of 1 is 4.) Logit regression also overcomes the problem that, with only two possible values of the dependent variable, 0 or 1, the error term in the standard regression model cannot be normally distributed. We should note that if we know L, we can convert back to : 1 1%e &L The model in logit regression (or logistic regression) is: logit log 1& â 0 %â 1 X 1 % â 2 X 2 % â 3 X 3 % % â K&1 X K&1 Before discussing how we estimate the â's, let's look at an example. It comes from an analysis by Furstenberg and Morgan of the National Survey of Children and looks at 15 and 16 year-olds. The dependent variable of interest is Y = Ever had intercourse. Y=1 if they have and 0 if not. In one of their papers, they considered two dummy variable predictors: Black and Female.

SOCY5061 5. logit inter female black Iteration 0: Log Likelihood = -264.6267 Iteration 1: Log Likelihood =-246.37474 Iteration 2: Log Likelihood =-245.89753 Iteration 3: Log Likelihood =-245.89738 Logit Estimates Number of obs = 462 chi2(2) = 37.46 rob > chi2 = 0.0000 Log Likelihood = -245.89738 seudo R2 = 0.0708 inter Coef. Std. Err. t > t [95% Conf. Interval] ---------+-------------------------------------------------------------------- female -.6478314.2250376-2.879 0.004-1.090063 -.2055996 black 1.31346.2377786 5.524 0.000.8461905 1.78073 _cons -1.12112.1624118-6.903 0.000-1.440282 -.8019567 The interpretation of the constant, -1.12, is that the odds of having intercourse for a white male (female=0, black=0) are e -1.12. The coefficient of female, -.65: The relative odds of having ever had intercourse for girls compared to boys of the same race is e -.65. The coefficient of black, 1.31: The relative odds of having ever had intercourse for blacks compared to whites of the same gender are e 1.31. We can either calculate the relative odds ourselves (using the display command in STATA) or ask STATA to compute them for us, by using the logistic command.. logistic inter female black Logit Estimates Number of obs = 462 chi2(2) = 37.46 rob > chi2 = 0.0000 Log Likelihood = -245.89738 seudo R2 = 0.0708 inter Odds Ratio Std. Err. t > t [95% Conf. Interval] ---------+-------------------------------------------------------------------- female.5231791.117735-2.879 0.004.3361953.814159 black 3.71902.8843034 5.524 0.000 2.330751 5.934185 We also note that when â<0 odds decrease as X increases (e.g. as we go from female = 0 to female = 1 â=0 X is unrelated to Y. â>0 odds are greater as X increases (e.g. as we go from black = 0 to black =1) In general, e â is the relative odds that Y=1 when an individual with X=x is compared to an individual with exactly the same characteristics except X= x-1.

SOCY5061 6 ESTIMATING THE ROBABILITY THAT Y=1 STATA will estimate from these results. For example, for a black female, we estimate logit = -1.12 -.65 (female=1) + 1.31 (black=1) = L Then the estimated = 1/(1+e -L ). After we use the command logit, we can then issue the command:. predict phat. tab phat phat Freq. ercent Cum. ------------+-----------------------------------.1456728 175 37.88 37.88 white females.2458037 177 38.31 76.19 white males.3880561 58 12.55 88.74 black females.5479375 52 11.26 100.00 black males ------------+----------------------------------- Total 462 100.00 LINEAR VS MULTILICATIVE MODEL The logit model is linear: logit log 1& â 0 %â 1 X 1 % â 2 X 2 % â 3 X 3 % % â K&1 X K&1 But the model for the odds is multiplicative (please note that the subscripts go on the same line when they are already in a superscript): HYOTHESIS TESTS 1& e â 0%â 1 X 1 %â 2 X 2 %â 3 X 3 % %â K&1X K&1 e â 0 e â 1X 1 e â 2X 2 e â 3X 3 e â K&1X K&1 The STATA output gives us approximate t-tests for the coefficients and approximate confidence intervals. There is also a test analogous to the F-test that allows us to compare nested models - and this test is actually the more precise one. It is a 2 test. The degrees of freedom for the test are equal to the number of X variables included in the model. The statistic compares the model with only the intercept to the one with K-1 predictors. In the case of our example above, we have two predictors, and the chi2(2) tells us that, with two variables, we fit our data significantly better than if we only use an intercept. The actual test statistic uses the Log Likelihood, which is calculated because the method of estimation is that of maximum likelihood. In this case, the quantity -2 Log Likelihood has a chi-square distribution, with degrees of freedom = n-k (n=sample size, K=number of estimated parameters).

SOCY5061 7 The logit and logistic output in STATA gives the information for testing whether our model fits better than the one with intercept only. The difference between two likelihood statistics also has a chi-square distribution with df= K-H where H is the number of parameters estimated in the smaller model. Therefore, the chi2(k-h) statistic given in the STATA output is actually -2 Log Likelihood K - -2 Log Likelihood H = -2 (Log Likelihood K - Log Likelihood H ) Logit regression results differ in one important way from OLS. In OLS, the F-test was completely equivalent to the t- test when we were comparing a model with a nested model with one fewer predictors. Here, the chi-square test may give a different result from the individual t-tests - and the chi-square test is more reliable. Next time we will discuss how we get the estimates of the coefficients - what this likelihood is - and some diagnostics.