STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Introduction to Bivariate Regression

Similar documents
POLS 205 Political Science as a Social Science. Analyzing Bivariate Relationships

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Introduction to Mulitple Regression

Current State of Global HIV Care Continua. Reuben Granich 1, Somya Gupta 1, Irene Hall 2, John Aberle-Grasse 2, Shannon Hader 2, Jonathan Mermin 2

מדינת ישראל. Tourist Visa Table

מדינת ישראל. Tourist Visa Table. Tourist visa exemption is applied to national and official passports only, and not to other travel documents.

THE CARE WE PROMISE FACTS AND FIGURES 2017

APPENDIX II - TABLE 2.3 ANTI-TOBACCO MASS MEDIA CAMPAIGNS

World Health organization/ International Society of Hypertension (WH0/ISH) risk prediction charts

ADMINISTRATIVE AND FINANCIAL MATTERS. Note by the Executive Secretary * CONTENTS. Explanatory notes Tables. 1. Core budget

CALLING ABROAD PRICES FOR EE SMALL BUSINESS PLANS

Maternal Deaths Disproportionately High in Developing Countries

Eligibility List 2018

Main developments in past 24 hours

FRAMEWORK CONVENTION ALLIANCE BUILDING SUPPORT FOR TOBACCO CONTROL. Smoke-free. International Status Report

Hearing loss in persons 65 years and older based on WHO global estimates on prevalence of hearing loss

Supplementary appendix

3.5 Consumption Annual Prevalence Opiates

Drug Prices Report Opioids Retail and wholesale prices * and purity levels,by drug, region and country or territory (prices expressed in US$ )

AGaRT The Advisory Group on increasing access to Radiotherapy Technology in low and middle income countries

WHO report highlights violence against women as a global health problem of epidemic proportions

Copyright 2011 Joint United Nations Programme on HIV/AIDS (UNAIDS) All rights reserved

TOBACCO USE PREVALENCE APPENDIX II: The following definitions are used in Table 2.1 and Table 2.3:

ICM: Trade-offs in the fight against HIV/AIDS

WELLNESS COACHING. Wellness & Personal Fitness Solution Providers

Challenges and Opportunities to Optimizing the HIV Care Continuum Can We Test and Treat Enough People to Make a Seismic Difference by 2030?

Why Invest in Nutrition?

1. Consent for Treatment This form must be completed in order to receive healthcare services in the campus clinic.

Annex 2 A. Regional profile: West Africa

World Health Organization Department of Communicable Disease Surveillance and Response

GLOBAL RepORt UNAIDS RepoRt on the global AIDS epidemic

ANNEX 3: Country progress indicators

Analysis of Immunization Financing Indicators from the WHO-UNICEF Joint Reporting Form (JRF),

The worldwide societal costs of dementia: Estimates for 2009

#1 #2 OR Immunity verified by immune titer (please attach report) * No titer needed if proof of two doses of Varicella provided

Articles. Funding Bill & Melinda Gates Foundation.

Country-wise and Item-wise Exports of Animal By Products Value Rs. Lakh Quantity in '000 Unit: Kgs Source: MoC Export Import Data Bank

Tobacco: World Markets and Trade

BCG. and your baby. Immunisation. Protecting babies against TB. the safest way to protect your child

Global EHS Resource Center

Certificate of Immunization

Outcomes of the Global Consultation Interim diagnostic algorithms and Operational considerations

This portion to be completed by the student Return by July 1 Please use ballpoint pen

The Single Convention on Narcotic Drugs- Implementation in Six Countries: Albania, Bangladesh, India, Kyrgyzstan, Sri Lanka, Ukraine

The IB Diploma Programme Statistical Bulletin. November 2015 Examination Session. Education for a better world

Global and regional burden of first-ever ischaemic and haemorrhagic stroke during : findings from the Global Burden of Disease Study 2010

Terms and Conditions. VISA Global Customer Assistance Services

Undetectable = Untransmittable. Mariah Wilberg Communications Specialist

Stakeholders consultation on strengthened cooperation against vaccine preventable diseases

Calls from home residential tariffs

WELLNESS COACHING. Wellness & Personal Fitness Solution Providers NZ & Australia

we are daisy Daisy Conferencing Max Bridge charges* International charges* International toll-free access levy

FORMS MUST BE COMPLETED PRIOR TO THE START OF YOUR FIRST SEMESTER

Social Capital Achievement: 2009 Country Rankings

STUDENT HEALTH SERVICES NEW STUDENT QUESTIONNAIRE

Development Database. Compiled by James W. McGuire Department of Government Wesleyan University Summer

WHO Global Status Report on Alcohol 2004

Tipping the dependency

DEMOGRAPHIC CHANGES IN DEVELOPED AND DEVELOPING COUNTRIES

WORLD COUNCIL OF CREDIT UNIONS 2017 STATISTICAL REPORT

REQUIRED COLLEGIATE START. (High school students/ early entry only not for undergraduates) IMMUNIZATION FORM THIS IS REQUIRED INFORMATION

I. THE TRANSITION TO LOW FERTILITY AND ITS IMPLICATIONS FOR THE FUTURE

Copyright 2010 Joint United Nations Programme on HIV/AIDS (UNAIDS) All rights reserved

Supplementary Online Content

UNDERGRADUATE STUDENT HEALTH PACKET

JOINT TB AND HIV PROGRAMMING

Annual prevalence estimates of cannabis use in the late 1990s

STUDENT MEDICAL REPORT For Graduate and Part-time Undergraduate Students The State of Connecticut General Statutes Section 10a and Fairfield

Donor Support for Contraceptives and Condoms for STI/HIV Prevention

Malnutrition prevalences by country and year, from survey data and interpolated for reference years (1990, 1995, 2000)

FACTS AND FIGURES 2015 OUR VILLAGE

Supplementary Online Content

ACCESS 7. TOWARDS UNIVERSAL ACCESS: THE WAY FORWARD

STUDENT HEALTH FORM *IMPORTANT DEADLINES: DUE AUGUST 1 FOR AUGUST ENTRY (FALL SEMESTER) Due January 1 for January Entry(spring semester)

Country Profiles for Population and Reproductive Health: Policy Developments and Indicators 2003

Health Services Immunization and Health Information

Dear New Student and Family,

all incoming UWL students MUST submit an up-to-date immunization history, including vaccination dates.

NON-STANDARD PRICE GUIDE FOR EE SMALL BUSINESS More information about out-of-bundle charges for our small business customers

Epidemiological Estimates for Haemoglobin Disorders: WHO South East Asian Region by Country

Global Fund ARV Fact Sheet 1 st June, 2009

The Immunization Record is available to download from the Health Insurance and Immunizations website at drexel.edu/hii/forms.

Decline in Human Fertility:

Tracking progress in achieving the global nutrition targets May 2014

HIV and development challenges for Africa Catherine Hankins, Associate Director & Chief Scientific Adviser to UNAIDS

Global Fund Mid-2013 Results

Seizures of ATS (excluding ecstasy ), 2010

Name DOB / / LAST FIRST MI Home Address: Street City: State: Zip: Name of Parent/Guardian(Emergency Contact) Relationship Contact Phone Number

Developed for the Global Initiative for Asthma

THE SOCIAL FOUNDATIONS OF WORLD HAPPINESS

CIGARETTE PACKAGE HEALTH WARNINGS

Update: Xpert MTB/RIF system for rapid diagnosis of TB and MDR-TB

Country Length Discount Travel Period Anguilla All 20% off 08/24/11 12/15/11 Antigua All 20% off 08/24/11 12/15/11 Argentina All 20% off 08/24/11

Welcomes New Students

ESPEN Congress Geneva 2014 FOOD: THE FACTOR RESHAPING THE SIZE OF THE PLANET

Global malaria mortality between 1980 and 2010: a systematic analysis

Global, regional, and national burden of Parkinson s disease, : a systematic analysis for the Global Burden of Disease Study 2016

Student Health Center Mandatory Immunization Information

PUBLIC HEALTH FACT SHEET

Transcription:

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences Introduction to Bivariate Regression Christopher Adolph Department of Political Science and Center for Statistics and the Social Sciences University of Washington, Seattle Chris Adolph (UW) Bivariate Regression 1 / 55

Motivating Example We have cross-national data from several sources: Fertility The average number of children born per adult female, in 2000 (United Nations) Education Ratio The ratio of girls to boys in primary and secondary education, in 2000 (Word Bank Development Indicators) GDP per capita Economic activity in thousands of dollars, purchasing power parity in 2000 (Penn World Tables) What are the levels of measurement of these variables? Our question: how are these variables related to each other? Chris Adolph (UW) Bivariate Regression 2 / 55

Motivating Example: Fertility, Female Education, and Development Specifically, we ask: Chris Adolph (UW) Bivariate Regression 3 / 55

Motivating Example: Fertility, Female Education, and Development Specifically, we ask: If the level of female education changed by a certain amount, how much would we expect Fertility to change? Chris Adolph (UW) Bivariate Regression 3 / 55

Motivating Example: Fertility, Female Education, and Development Specifically, we ask: If the level of female education changed by a certain amount, how much would we expect Fertility to change? If the level of GDP per capita changed by a certain amount, how much would we expect Fertility to change? Chris Adolph (UW) Bivariate Regression 3 / 55

Motivating Example: Fertility, Female Education, and Development Specifically, we ask: If the level of female education changed by a certain amount, how much would we expect Fertility to change? If the level of GDP per capita changed by a certain amount, how much would we expect Fertility to change? How much would we expect our predictions to be off because of other random factors (noise)? Chris Adolph (UW) Bivariate Regression 3 / 55

Motivating Example: Fertility, Female Education, and Development Specifically, we ask: If the level of female education changed by a certain amount, how much would we expect Fertility to change? If the level of GDP per capita changed by a certain amount, how much would we expect Fertility to change? How much would we expect our predictions to be off because of other random factors (noise)? How much would we expect our predictions to be off because of sampling variability (poor estimation)? Answering these questions will go far towards towards answering hypotheses about relationships between variables Chris Adolph (UW) Bivariate Regression 3 / 55

Outline Review the Univariate Summary Statistics for our Example Explore the Bivariate Relationship between Fertility & Education Ratio Explore the Bivariate Relationship between Fertility & GDP per capita Throughout, develop a deeper understanding of linear regression Chris Adolph (UW) Bivariate Regression 4 / 55

Summary of Univariate Distribution: Fertility Frequency 0 10 20 30 40 Median = 2.60 Mean = 3.12 children std dev = 1.67 children 0 2 4 6 8 Fertility Rate Chris Adolph (UW) Bivariate Regression 5 / 55

Summary of Univariate Distribution: Fertility Frequency 0 10 20 30 40 Median = 2.60 Mean = 3.12 children std dev = 1.67 children How would you describe this distribution? 0 2 4 6 8 Fertility Rate Chris Adolph (UW) Bivariate Regression 5 / 55

Summary of Univariate Distribution: Education Ratio Frequency 0 10 20 30 40 50 Median = 99.60% Mean = 94.48% std. dev. = 12.45% 60 70 80 90 100 110 120 Female Education as % of Male Chris Adolph (UW) Bivariate Regression 6 / 55

Summary of Univariate Distribution: Education Ratio Frequency 0 10 20 30 40 50 Median = 99.60% Mean = 94.48% std. dev. = 12.45% How would you describe this distribution? 60 70 80 90 100 110 120 Female Education as % of Male Chris Adolph (UW) Bivariate Regression 6 / 55

Summary of Univariate Distribution: GDP per capita Frequency 0 10 20 30 40 50 60 Median = $6047 Mean = $10,200 std. dev. = $10,078 0 10000 20000 30000 40000 50000 GDP per capita (PPP $k) Chris Adolph (UW) Bivariate Regression 7 / 55

Summary of Univariate Distribution: GDP per capita Frequency 0 10 20 30 40 50 60 Median = $6047 Mean = $10,200 std. dev. = $10,078 How would you describe this distribution? 0 10000 20000 30000 40000 50000 GDP per capita (PPP $k) Chris Adolph (UW) Bivariate Regression 7 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate How would you describe the relationship between Fertility & Education Ratio? Chris Adolph (UW) Bivariate Regression 8 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate How would you describe the relationship between Fertility & Education Ratio? If I asked you to predict Fertility for a country not sampled, how accurate do you expect your prediction to be? Chris Adolph (UW) Bivariate Regression 8 / 55

Fertility Rate 8 6 4 2 Niger Uganda Somalia Chad emen, Rep. Burkina Faso Ethiopia Zambia Malawi Guinea Bissau Benin Mali Liberia Equatorial Guinea Rwanda Mozambique Eritrea Senegal Cote Togo d'ivoire Iraq Mauritania Congo, Rep. Kenya Djibouti Solomon Guatemala Ghana Islands Comoros Oman Vanuatu Samoa Swaziland Nepal Bolivia Cambodia Tajikistan Gabon Bhutan Jordan Namibia Tonga Lesotho Zimbabwe Paraguay Botswana Belize India El Salvador Bangladesh Ecuador Israel Malaysia Qatar Nicaragua Peru Fiji Morocco United South Panama Maldives Arab Africa Guyana Argentina Colombia Indonesia Costa Jamaica Brunei Kuwait Bahrain Emirates Mexico Netherlands Vietnam Azerbaijan Albania Lebanon Brazil Rica Mauritius Kazakhstan New Tunisia Chile Uruguay United Australia France IcelandMongolia Georgia Denmark Austria Barbados Canada Croatia Cuba Cyprus Ireland States Zealand Antilles Macedonia, Netherlands United Trinidad Moldova Luxembourg Norway Singapore Malta Kingdom Finland Belgium FYRTobago Switzerland Slovak Korea, Bulgaria Germany Lithuania Poland Romania Hungary Estonia Portugal Sweden Japan Belarus Slovenia Greece Rep. Ukraine Latvia Spain Republic Macao, China Labelling cases sometimes helps, especially for identifying outliers 50 60 70 80 90 100 110 120 Female Students as % of Male Chris Adolph (UW) Bivariate Regression 9 / 55

Fertility Rate 8 6 4 2 Niger Uganda Somalia Chad emen, Rep. Burkina Faso Ethiopia Zambia Malawi Guinea Bissau Benin Mali Liberia Equatorial Guinea Rwanda Mozambique Eritrea Senegal Cote Togo d'ivoire Iraq Mauritania Congo, Rep. Kenya Djibouti Solomon Guatemala Ghana Islands Comoros Oman Vanuatu Samoa Swaziland Nepal Bolivia Cambodia Tajikistan Gabon Bhutan Jordan Namibia Tonga Lesotho Zimbabwe Paraguay Botswana Belize India El Salvador Bangladesh Ecuador Israel Malaysia Qatar Nicaragua Peru Fiji Morocco United South Panama Maldives Arab Africa Guyana Argentina Colombia Indonesia Costa Jamaica Brunei Kuwait Bahrain Emirates Mexico Netherlands Vietnam Azerbaijan Albania Lebanon Brazil Rica Mauritius Kazakhstan New Tunisia Chile Uruguay United Australia France IcelandMongolia Georgia Denmark Austria Barbados Canada Croatia Cuba Cyprus Ireland States Zealand Antilles Macedonia, Netherlands United Trinidad Moldova Luxembourg Norway Singapore Malta Kingdom Finland Belgium FYRTobago Switzerland Slovak Korea, Bulgaria Germany Lithuania Poland Romania Hungary Estonia Portugal Sweden Japan Belarus Slovenia Greece Rep. Ukraine Latvia Spain Republic Macao, China 50 60 70 80 90 100 110 120 Female Students as % of Male Labelling cases sometimes helps, especially for identifying outliers What makes a point an outlier? Chris Adolph (UW) Bivariate Regression 9 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The best fit line is the line that passes closest to the majority of the points Chris Adolph (UW) Bivariate Regression 10 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The best fit line is the line that passes closest to the majority of the points If we take this line to be our model of Fertility, how do we interpret it? Chris Adolph (UW) Bivariate Regression 10 / 55

Best fit lines Customarily, in statistics, we write the equation of a line as: y = β 0 + β 1 x where: y i is the dependent variable x is the independent variable, β 1 is the slope of the line, or the change in y for a 1 unit change in x, and β 0 is the intercept, or value of y when x = 0 Chris Adolph (UW) Bivariate Regression 11 / 55

Best fit for fertility against education ratio Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10 EduRatio The above equation is the best fit line given by linear regression The ˆβ s are the estimated linear regression coefficients Fertility is the fitted value, or model prediction, of the level of Fertility givem the EduRatio Chris Adolph (UW) Bivariate Regression 12 / 55

Intrepreting regression coefficients Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10 EduRatio Interpreting ˆβ 1 = 0.10: Increasing EduRatio by 1 unit lowers Fertility by 0.10 units. Because EduRatio is measured in percentage points, this means a 10% increase in female education (relative to males) will lower the number of children a woman has over her lifetime by 1 on average. Chris Adolph (UW) Bivariate Regression 13 / 55

Intrepreting regression intercepts Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10EduRatio Interpreting ˆβ 0 = 12.59: If EduRatio is 0, Fertility will be 12.59. If there are no girls in primary or secondary education, then women are expected to have 12.59 children on average over their lifetimes. Can we trust this prediction? Chris Adolph (UW) Bivariate Regression 14 / 55

Intrepreting regression intercepts Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10EduRatio Interpreting ˆβ 0 = 12.59: If EduRatio is 0, Fertility will be 12.59. If there are no girls in primary or secondary education, then women are expected to have 12.59 children on average over their lifetimes. Can we trust this prediction? No. No country has 0 female education, so this is an extrapolation from the model. Chris Adolph (UW) Bivariate Regression 14 / 55

Using regression coefficients to predict specific cases Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10EduRatio How many children do we expect women to get if girls get half the education boys do? If EduRatio is 50, Fertility will be 12.59 0.10 50 = 7.59. How many children do we expect women to have if girls get the same education boys do? If EduRatio is 100, Fertility will be 12.59 0.10 100 = 2.59. Chris Adolph (UW) Bivariate Regression 15 / 55

Using regression coefficients to predict specific cases Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10EduRatio If EduRatio is 100, Fertility will be 12.59 0.10 100 = 2.59. Does this hold exactly for any country with education parity? Chris Adolph (UW) Bivariate Regression 16 / 55

Using regression coefficients to predict specific cases Fertility = ˆβ 0 + ˆβ 1 EduRatio Fertility = 12.59 0.10EduRatio If EduRatio is 100, Fertility will be 12.59 0.10 100 = 2.59. Does this hold exactly for any country with education parity? No. It holds on average. In any specific case i, there is some error between the expected and actual levels of Fertility Chris Adolph (UW) Bivariate Regression 16 / 55

The linear regression model y i = β 0 + β 1 x i + ε i To account for the random deviation of each case from the underlying trend, we add an error term, ε i. We will assume our y i s follow the above model That is, we will assume there is some true β 0 and β 1 which generated the y i we observe, and some true error from this trend Chris Adolph (UW) Bivariate Regression 17 / 55

The linear regression model ŷ i = ˆβ 0 + ˆβ 1 x i + ˆε i When we estimate this model, we designate the estimates by adding hats The estimates ( ˆβ 0, ˆβ 1, ˆε i ) probably differ from the (usually unknown) true values (β 0, β 1, ε i ) To emphasize this, we will call ˆε i the residual, since it is not the true error, but only an estimate Chris Adolph (UW) Bivariate Regression 18 / 55

Estimating linear regression coefficients ŷ i = ˆβ 0 + ˆβ 1 x i + ˆε i How do we obtain our estimates of the β s? The full details are beyond the scope of 221 A key assumption is that ε i is Normally distributed: ε i Normal(0, σ 2 ) Chris Adolph (UW) Bivariate Regression 19 / 55

(Source: Larry Gonick & Wollcott Smith, The Cartoon Guide to Statistics) The distribution of ε i determines how closely or widely the y i s are spaced around the best fit line Our key simplifying assumption is that everywhere around the line, the y i s are spread with the same Normal distribution Chris Adolph (UW) Bivariate Regression 20 / 55

Estimating ˆβ With this assumption in mind, how do we find the best fit line? Chris Adolph (UW) Bivariate Regression 21 / 55

Estimating ˆβ With this assumption in mind, how do we find the best fit line? Perhaps the line that minimizes the total residuals? Chris Adolph (UW) Bivariate Regression 21 / 55

Estimating ˆβ With this assumption in mind, how do we find the best fit line? Perhaps the line that minimizes the total residuals? But some residuals are positive, and others negative their sum is always 0 Chris Adolph (UW) Bivariate Regression 21 / 55

Estimating ˆβ With this assumption in mind, how do we find the best fit line? Perhaps the line that minimizes the total residuals? But some residuals are positive, and others negative their sum is always 0 So lets minimize the sum of squared error! Linear regression is fitted using the least squares procedure Chris Adolph (UW) Bivariate Regression 21 / 55

(Source: Larry Gonick & Wollcott Smith, The Cartoon Guide to Statistics) The least squares estimates are the ˆβ s that minimize the total area of the above squares Chris Adolph (UW) Bivariate Regression 22 / 55

(Source: Larry Gonick & Wollcott Smith, The Cartoon Guide to Statistics) Statistics software can find these ˆβ s easily Chris Adolph (UW) Bivariate Regression 23 / 55

Residuals Notice the distinction between what we explain and what is left unexplained (Source: Larry Gonick & Wollcott Smith, The Cartoon Guide to Statistics) Chris Adolph (UW) Bivariate Regression 24 / 55

Analysis of variance The total variation in y i is its total variance from the mean ȳ, or n i=1 (y i ȳ) 2 Using least squares, we can break down the variance in y i into two components: Sum of square errors (SSE) n i=1 (y i ŷ i ) 2 Chris Adolph (UW) Bivariate Regression 25 / 55

Analysis of variance The total variation in y i is its total variance from the mean ȳ, or n i=1 (y i ȳ) 2 Using least squares, we can break down the variance in y i into two components: Sum of square errors (SSE) Regression sum of squares (RSS) n i=1 (y i ŷ i ) 2 n i=1 (ŷ i ȳ) 2 Chris Adolph (UW) Bivariate Regression 25 / 55

Analysis of variance The total variation in y i is its total variance from the mean ȳ, or n i=1 (y i ȳ) 2 Using least squares, we can break down the variance in y i into two components: Sum of square errors (SSE) Regression sum of squares (RSS) Total sum of squares (TSS) n i=1 (y i ŷ i ) 2 n i=1 (ŷ i ȳ) 2 n i=1 (y i ȳ) 2 The Regression sum of squares (RSS) is what we have explained The Sum of squared errors (SSE) is what is left unexplained Chris Adolph (UW) Bivariate Regression 25 / 55

Analysis of variance The Sum of squared errors is what is left unexplained: n (y i ŷ i ) 2 i=1 Chris Adolph (UW) Bivariate Regression 26 / 55

Analysis of variance The Sum of squared errors is what is left unexplained: n (y i ŷ i ) 2 = i=1 n i=1 ˆε 2 i Chris Adolph (UW) Bivariate Regression 26 / 55

Analysis of variance The Sum of squared errors is what is left unexplained: n (y i ŷ i ) 2 = i=1 n i=1 ˆε 2 i A very useful summary of this is the square root of the mean squared error: RMSE = 1 n (y i ŷ i ) n 2 i=1 Chris Adolph (UW) Bivariate Regression 26 / 55

Analysis of variance The Sum of squared errors is what is left unexplained: n (y i ŷ i ) 2 = i=1 n i=1 ˆε 2 i A very useful summary of this is the square root of the mean squared error: RMSE = 1 n (y i ŷ i ) n 2 i=1 This is how much a prediction from this linear regression will differ from the true y i on average Also known as the standard error of the regression Chris Adolph (UW) Bivariate Regression 26 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The residuals for the regression of Fertility on Education Ratio Chris Adolph (UW) Bivariate Regression 27 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The residuals for the regression of Fertility on Education Ratio This line minimizes the squared deviations on the dependent variable Chris Adolph (UW) Bivariate Regression 27 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The smaller the sum of squared residuals, the better the model fits the data. Chris Adolph (UW) Bivariate Regression 28 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The smaller the sum of squared residuals, the better the model fits the data. The quality of model fit is a separate issue from the substantive strength of the relationship, which is given by β, or the change in y for a one unit change in x Chris Adolph (UW) Bivariate Regression 28 / 55

Goodness of fit Our model is captured in the β s, or regression coefficients. In contrast to... The correlation coefficient r, a goodness of fit measure; larger values imply better fit of the model to the data In our example, r between Fertility and Education Ratio is 0.75 Substantively, this number is hard to interpret (What s a big r? A small r? Arbitrary) Chris Adolph (UW) Bivariate Regression 29 / 55

The coefficient of determination, R 2 One easy to interpret goodness of fit measure is R 2, known as the coefficient of determination In general, R 2 is the ratio of the variance the model explains to the total variance: R 2 = RSS TSS = 1 SSE TSS In bivariate regression only, R 2 also the square of r X,Y In our example, R 2 = 0.56, which says that Education Ratio explains 56% of the variation in Fertility, and vice versa R 2 is a proportional reduction in error (PRE) statistic Chris Adolph (UW) Bivariate Regression 30 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate I prefer a more tangible measure of goodness of fit, the root mean squared error (RMSE). Chris Adolph (UW) Bivariate Regression 31 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate I prefer a more tangible measure of goodness of fit, the root mean squared error (RMSE). RMSE is how much your model predictions miss by : Chris Adolph (UW) Bivariate Regression 31 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate I prefer a more tangible measure of goodness of fit, the root mean squared error (RMSE). RMSE is how much your model predictions miss by : here, 1.12 children per female Chris Adolph (UW) Bivariate Regression 31 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate RMSE is better than R 2 because it can be compared across models and datasets R 2 can t. Chris Adolph (UW) Bivariate Regression 32 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate RMSE is better than R 2 because it can be compared across models and datasets R 2 can t. A question: we assumed the errors would be Normal are they? Chris Adolph (UW) Bivariate Regression 32 / 55

Density 0.35 0.3 0.25 0.2 0.15 Recall that linear regression assumes the ε i s are Normally distributed. 0.1 0.05 2 1 0 1 2 3 4 Residuals from Fertility vs Edu Ratio Chris Adolph (UW) Bivariate Regression 33 / 55

Density 0.35 0.3 0.25 0.2 0.15 0.1 Recall that linear regression assumes the ε i s are Normally distributed. We do not assume that y i follows a bell curve, except after controlling for x i 0.05 2 1 0 1 2 3 4 Residuals from Fertility vs Edu Ratio Chris Adolph (UW) Bivariate Regression 33 / 55

Density 0.35 0.3 0.25 0.2 0.15 0.1 0.05 2 1 0 1 2 3 4 Residuals from Fertility vs Edu Ratio Recall that linear regression assumes the ε i s are Normally distributed. We do not assume that y i follows a bell curve, except after controlling for x i Do the residuals appear Normally distributed in this case? Chris Adolph (UW) Bivariate Regression 33 / 55

Uncertainty of ˆβ When estimating a mean or difference of means, we worried that by chance, our sample might not reflect the population That s a worry in linear regression as well Does ˆβ estimated from our sample reflect the true population β? Or did we get an unusual result due to sampling variability? Chris Adolph (UW) Bivariate Regression 34 / 55

Uncertainty of ˆβ As with estimating a mean, we can calculate the standard error of ˆβ Chris Adolph (UW) Bivariate Regression 35 / 55

Uncertainty of ˆβ As with estimating a mean, we can calculate the standard error of ˆβ se( ˆβ) is the amount we expect to miss the population β by on average over regression using repeated samples Chris Adolph (UW) Bivariate Regression 35 / 55

Uncertainty of ˆβ As with estimating a mean, we can calculate the standard error of ˆβ se( ˆβ) is the amount we expect to miss the population β by on average over regression using repeated samples Remarkably, the ˆβ s themselves are Normally distributed, no matter what y i we are modeling Chris Adolph (UW) Bivariate Regression 35 / 55

Uncertainty of ˆβ As with estimating a mean, we can calculate the standard error of ˆβ se( ˆβ) is the amount we expect to miss the population β by on average over regression using repeated samples Remarkably, the ˆβ s themselves are Normally distributed, no matter what y i we are modeling So we can use a t-test to see if our ˆβ s would differ from the null hypothesis purely by chance Chris Adolph (UW) Bivariate Regression 35 / 55

Uncertainty of ˆβ As with estimating a mean, we can calculate the standard error of ˆβ se( ˆβ) is the amount we expect to miss the population β by on average over regression using repeated samples Remarkably, the ˆβ s themselves are Normally distributed, no matter what y i we are modeling So we can use a t-test to see if our ˆβ s would differ from the null hypothesis purely by chance Often, we will consider the null hypothesis to be β null = 0, but sometimes we might want a different null Chris Adolph (UW) Bivariate Regression 35 / 55

Uncertainty of ˆβ We can also construct confidence intervals around ˆβ 0 and ˆβ 1 These CIs reflect the uncertainty created by randomly sampling our data from the population In 95% of samples, the true population β s should lie in their 95% confidence intervals Roughly, these intervals will be ±2 standard errors, if we have a lot of data Chris Adolph (UW) Bivariate Regression 36 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The standard errors of ˆβ reflect the fact that in 95% of randomly sampled datasets, the true best fit line for the population lies within range of the estimated line Chris Adolph (UW) Bivariate Regression 37 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate The standard errors of ˆβ reflect the fact that in 95% of randomly sampled datasets, the true best fit line for the population lies within range of the estimated line We can capture this wiggle room graphically Chris Adolph (UW) Bivariate Regression 37 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate Why don t 95% of the datapoints lie inside this interval? Chris Adolph (UW) Bivariate Regression 38 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate Why don t 95% of the datapoints lie inside this interval? Because of fundametal uncertainty, or RMSE Chris Adolph (UW) Bivariate Regression 38 / 55

50 60 70 80 90 100 110 120 2 4 6 8 Female Students as % of Male Fertility Rate Why don t 95% of the datapoints lie inside this interval? Because of fundametal uncertainty, or RMSE The CIs just measure uncertainty in the best fit line, not in the data itself Chris Adolph (UW) Bivariate Regression 38 / 55

A standard regression table Regression of Fertility on Education Ratio Variable Estimates se t-stat p-value Intercept 12.59 (0.75) 16.75 < 0.001 Education Ratio 0.10 (0.01) 12.71 < 0.001 N 130 R 2 0.56 RMSE 1.12 The most common presentation of a linear regression is the above table Usually, graphics are more informative and easier to read, but older articles rely heavily on this tabular format Understanding these tables will be important for the final exam. Let s take this one apart Chris Adolph (UW) Bivariate Regression 39 / 55

A standard regression table Regression of Fertility on Education Ratio Variable Estimates se t-stat p-value Intercept 12.59 (0.75) 16.75 < 0.001 Education Ratio 0.10 (0.01) 12.71 < 0.001 N 130 R 2 0.56 RMSE 1.12 The middle of the table contains several important quantities regarding our independent variable(s): Estimates: the ˆβ s, or regression coefficients se: the standard errors of ˆβ t-stat: the t-statistic for the regression coefficient, or ˆβ/se( ˆβ) p-value: the probability of seeing such a large t-stat by chance Chris Adolph (UW) Bivariate Regression 40 / 55

Regression of Fertility on GDP per capita 95% Confidence Interval Variable Estimates Lower Upper Intercept 12.59 [11.11, 14.08] Education Ratio 0.10 [ 0.12, 0.08] N 130 R 2 0.36 RMSE 1.35 Just as will our other estimates, we can construct confidence intervals around our ˆβ s Our results show 95% confidence that a 1 unit (1%) increase in education of girls relative to boys lowers fertility by between 0.08 and 0.12 children per woman We would only expect the truth to lie outside this interval in 1 of 20 random samples Chris Adolph (UW) Bivariate Regression 41 / 55

Wait a minute! When we considered the relationship of female education and fertility, we also hypothesized an effect of GDP per capita We suspected this might be an indirect effect, flowing through female education Can we use regression to check for an effect of GDP? Chris Adolph (UW) Bivariate Regression 42 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate Let s regress Fertility on GDP per capita Chris Adolph (UW) Bivariate Regression 43 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate Let s regress Fertility on GDP per capita Does this scatterplot suggest a linear relationship? Chris Adolph (UW) Bivariate Regression 43 / 55

Fertility Rate 8 6 4 2 iger anda malia had hiopia en, alawi na Faso Guinea mbia Rep. wanda beria enin quatorial a Bissau ambique Mali enegal Guinea trea e uritania ogo go, enya Iraq d'ivoire Rep. mon Guatemala hana Djibouti Vanuatu Samoa Islands Bolivia moros Oman esotho Swaziland mbodia Nepal ikistan Tonga Gabon hutan mbabwe Jordan Namibia Paraguay Botswana Belize icaragua India El gladesh Ecuador Salvador Fiji Malaysia Qatar Maldives Peru Israel Morocco South Guyana Colombia ndonesia Jamaica Panama Africa Costa Albania Argentina BahrainUnited Arab Emirates Lebanon Brazil Mexico Rica Kuwait Brunei ongolia ietnam zerbaijan Kazakhstan Tunisia Netherlands Uruguay Chile Mauritius New Antilles Zealand Cuba CyprusDenmark Finland Australia Ireland France Iceland United States Georgia oldova cedonia, Trinidad FYR Croatia Barbados Malta Netherlands Norway Belgium Canada Luxembo Romania Bulgaria Lithuania Poland Belarus Hungary Estonia Korea, GreeceRep. Germany Japan Austria Latvia Portugal and United Tobago Sweden Kingdom Singapore Slovak Republic Switzerland Ukraine Slovenia Spain Macao, China Not really. Later, we ll discuss solutions for curved relationships 0 10 20 30 40 50 GDP per capita (PPP, $k) Chris Adolph (UW) Bivariate Regression 44 / 55

Fertility Rate 8 6 4 2 iger anda malia had hiopia en, alawi na Faso Guinea mbia Rep. wanda beria enin quatorial a Bissau ambique Mali enegal Guinea trea e uritania ogo go, enya Iraq d'ivoire Rep. mon Guatemala hana Djibouti Vanuatu Samoa Islands Bolivia moros Oman esotho Swaziland mbodia Nepal ikistan Tonga Gabon hutan mbabwe Jordan Namibia Paraguay Botswana Belize icaragua India El gladesh Ecuador Salvador Fiji Malaysia Qatar Maldives Peru Israel Morocco South Guyana Colombia ndonesia Jamaica Panama Africa Costa Albania Argentina BahrainUnited Arab Emirates Lebanon Brazil Mexico Rica Kuwait Brunei ongolia ietnam zerbaijan Kazakhstan Tunisia Netherlands Uruguay Chile Mauritius New Antilles Zealand Cuba CyprusDenmark Finland Australia Ireland France Iceland United States Georgia oldova cedonia, Trinidad FYR Croatia Barbados Malta Netherlands Norway Belgium Canada Luxembo Romania Bulgaria Lithuania Poland Belarus Hungary Estonia Korea, GreeceRep. Germany Japan Austria Latvia Portugal and United Tobago Sweden Kingdom Singapore Slovak Republic Switzerland Ukraine Slovenia Spain Macao, China 0 10 20 30 40 50 GDP per capita (PPP, $k) Not really. Later, we ll discuss solutions for curved relationships For now, let s proceed with the best linear fit Chris Adolph (UW) Bivariate Regression 44 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate This is the least squares fit (What does that mean?) Chris Adolph (UW) Bivariate Regression 45 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate This is the least squares fit (What does that mean?) How good does this fit look? Chris Adolph (UW) Bivariate Regression 45 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate Can you imagine an alternative model that would reduce the sum of squared residuals further? Chris Adolph (UW) Bivariate Regression 46 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate Can you imagine an alternative model that would reduce the sum of squared residuals further? Perhaps a concave curve? Chris Adolph (UW) Bivariate Regression 46 / 55

Density 0.25 0.2 0.15 0.1 Do the residuals look Normally distributed? 0.05 2 0 2 4 Residuals from Fertility vs GDP Chris Adolph (UW) Bivariate Regression 47 / 55

Density 0.25 0.2 0.15 0.1 0.05 2 0 2 4 Residuals from Fertility vs GDP Do the residuals look Normally distributed? A strongly skewed distribution of errors is cause for concern. More next week Chris Adolph (UW) Bivariate Regression 47 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate How do we interpret this 95% confidence interval? Chris Adolph (UW) Bivariate Regression 48 / 55

0 10 20 30 40 50 2 4 6 8 GDP per capita (PPP, $k) Fertility Rate How do we interpret this 95% confidence interval? Why don t 95% of the points lie inside it? Chris Adolph (UW) Bivariate Regression 48 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 How do we interpret this table? Chris Adolph (UW) Bivariate Regression 49 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much do we expect Fertility to change when we increase GDP by $1000? Chris Adolph (UW) Bivariate Regression 50 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much do we expect Fertility to change when we increase GDP by $1000? decrease by 0.1 children 2 What would Fertility be if GDP were $1000? $10,000? $30,000? Chris Adolph (UW) Bivariate Regression 50 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much do we expect Fertility to change when we increase GDP by $1000? decrease by 0.1 children 2 What would Fertility be if GDP were $1000? $10,000? $30,000? 4.03, 3.13, and 1.13, respectively. 3 What would Fertility be if GDP were 0? Do you trust this estimate? Chris Adolph (UW) Bivariate Regression 50 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much do we expect Fertility to change when we increase GDP by $1000? decrease by 0.1 children 2 What would Fertility be if GDP were $1000? $10,000? $30,000? 4.03, 3.13, and 1.13, respectively. 3 What would Fertility be if GDP were 0? Do you trust this estimate? 4.13. No this is an extrapolation. Chris Adolph (UW) Bivariate Regression 50 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 Suppose we drew another sample of countries. Would we expect to see a GDP different from zero in that case? Chris Adolph (UW) Bivariate Regression 51 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 Suppose we drew another sample of countries. Would we expect to see a GDP different from zero in that case? Yes. 2 Why? Chris Adolph (UW) Bivariate Regression 51 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 Suppose we drew another sample of countries. Would we expect to see a GDP different from zero in that case? Yes. 2 Why? The se is small relative to ˆβ, so the true β is probably far from 0. 3 How likely is it that we would see a t statistic this large if β = 0? Chris Adolph (UW) Bivariate Regression 51 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 Suppose we drew another sample of countries. Would we expect to see a GDP different from zero in that case? Yes. 2 Why? The se is small relative to ˆβ, so the true β is probably far from 0. 3 How likely is it that we would see a t statistic this large if β = 0? Very unlikely less than 1 in 1000 samples. Chris Adolph (UW) Bivariate Regression 51 / 55

Another regression table Regression of Fertility on GDP per capita 95% Confidence Interval Variable Estimates Lower Upper Intercept 4.13 [3.80, 4.46] GDP per capita ($k) 0.10 [ 0.12, 0.08] N 130 R 2 0.36 RMSE 1.35 1 What do these confidence intervals mean? Chris Adolph (UW) Bivariate Regression 52 / 55

Another regression table Regression of Fertility on GDP per capita 95% Confidence Interval Variable Estimates Lower Upper Intercept 4.13 [3.80, 4.46] GDP per capita ($k) 0.10 [ 0.12, 0.08] N 130 R 2 0.36 RMSE 1.35 1 What do these confidence intervals mean? In 95% of random samples, the true β s will lie inside these intervals Chris Adolph (UW) Bivariate Regression 52 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much of the variance in Fertility does this model explain? Chris Adolph (UW) Bivariate Regression 53 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much of the variance in Fertility does this model explain? 36 percent 2 When using the model to predict fertility for a specific country, how much does it miss by on average? Chris Adolph (UW) Bivariate Regression 53 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much of the variance in Fertility does this model explain? 36 percent 2 When using the model to predict fertility for a specific country, how much does it miss by on average? 1.35 3 How many cases were used in this analysis? Chris Adolph (UW) Bivariate Regression 53 / 55

Another regression table Regression of Fertility on GDP per capita Variable Estimates se t-stat p-value Intercept 4.13 (0.17) 24.57 < 0.001 GDP per capita ($k) 0.10 (0.01) 8.44 < 0.001 N 130 R 2 0.36 RMSE 1.35 1 How much of the variance in Fertility does this model explain? 36 percent 2 When using the model to predict fertility for a specific country, how much does it miss by on average? 1.35 3 How many cases were used in this analysis? 130 Chris Adolph (UW) Bivariate Regression 53 / 55

Foreshadowing How do we reconcile our two sets of results? Which model, if any, is right? To solve this conundrum, we need multiple regression: A method for regressing a dependent variable on several independent variables at once Then, at last, we can say something about confounders Fortunately, all of today s concepts will carry over to multiple regression Chris Adolph (UW) Bivariate Regression 54 / 55

Important linear regression concepts Regression coefficient β Estimate of regression coefficient ˆβ Standard error of est. of reg. coef. se( ˆβ) Fitted values ŷ i Regression errors ε i Residuals ε i Coefficient of determination R 2 n Sum of squared errors (SSE) i=1 ε i n Regression sum of squares (SSR) i=1 ŷi ȳ Chris Adolph (UW) Bivariate Regression 55 / 55