An Instrumental Variable Consistent Estimation Procedure to Overcome the Problem of Endogenous Variables in Multilevel Models

Similar documents
Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

EC352 Econometric Methods: Week 07

Instrumental Variables Estimation: An Introduction

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).

Understanding Uncertainty in School League Tables*

Problem set 2: understanding ordinary least squares regressions

Identifying Endogenous Peer Effects in the Spread of Obesity. Abstract

Module 14: Missing Data Concepts

INTRODUCTION TO ECONOMETRICS (EC212)

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

Final Exam - section 2. Thursday, December hours, 30 minutes

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

25. EXPLAINING VALIDITYAND RELIABILITY

Carrying out an Empirical Project

Issues Surrounding the Normalization and Standardisation of Skin Conductance Responses (SCRs).

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work?

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

Score Tests of Normality in Bivariate Probit Models

THE ROLE OF PSYCHOMETRIC ENTRANCE TEST IN ADMISSION PROCESSES FOR NON-SELECTIVE ACADEMIC DEPARTMENTS: STUDY CASE IN YEZREEL VALLEY COLLEGE

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Daniel Boduszek University of Huddersfield

Testing the Predictability of Consumption Growth: Evidence from China

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

The moderating impact of temporal separation on the association between intention and physical activity: a meta-analysis

Propensity Score Analysis Shenyang Guo, Ph.D.

Exploring the Factors that Impact Injury Severity using Hierarchical Linear Modeling (HLM)

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Households: the missing level of analysis in multilevel epidemiological studies- the case for multiple membership models

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Alcohol interventions in secondary and further education

Statistical reports Regression, 2010

Dichotomizing partial compliance and increased participant burden in factorial designs: the performance of four noncompliance methods

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT REGRESSION MODELS

(b) empirical power. IV: blinded IV: unblinded Regr: blinded Regr: unblinded α. empirical power

Multiple Regression Analysis

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Simple Linear Regression One Categorical Independent Variable with Several Categories

August 29, Introduction and Overview

Lecture 4: Research Approaches

Hierarchical Linear Models: Applications to cross-cultural comparisons of school culture

New Mexico TEAM Professional Development Module: Autism

PLS structural Equation Modeling for Customer Satisfaction -Methodological and Application Issues-

Regression Discontinuity Analysis

Ordinary Least Squares Regression

Assessing Studies Based on Multiple Regression. Chapter 7. Michael Ash CPPA

NORTH SOUTH UNIVERSITY TUTORIAL 2

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

This is a repository copy of Using Conditioning on Observed Choices to Retrieve Individual-Specific Attribute Processing Strategies.

10. LINEAR REGRESSION AND CORRELATION

Studying the effect of change on change : a different viewpoint

Analysis of TB prevalence surveys

MEA DISCUSSION PAPERS

Confirmatory Factor Analysis. Professor Patrick Sturgis

Economics 345 Applied Econometrics

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT QUADRATIC (U-SHAPED) REGRESSION MODELS

Introduction to Econometrics

WELCOME! Lecture 11 Thommy Perlinger

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER

Version No. 7 Date: July Please send comments or suggestions on this glossary to

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

Do children in private Schools learn more than in public Schools? Evidence from Mexico

Causal Methods for Observational Data Amanda Stevenson, University of Texas at Austin Population Research Center, Austin, TX

ESTUDIOS SOBRE LA ECONOMIA ESPAÑOLA

11/24/2017. Do not imply a cause-and-effect relationship

Multiple Linear Regression Analysis

REGRESSION MODELLING IN PREDICTING MILK PRODUCTION DEPENDING ON DAIRY BOVINE LIVESTOCK

multilevel modeling for social and personality psychology

Best on the Left or on the Right in a Likert Scale

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)

More Than Meets The Eye Dickey, Heather Suzanne; Norwood, Patricia Fernandes; Watson, Verity; Zangelidis, Alexandros

System Justifying Motives Can Lead to Both the Acceptance and Rejection of Innate. Explanations for Group Differences

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

1. The Role of Sample Survey Design

Propensity Score Matching with Limited Overlap. Abstract

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board

Cambridge International AS & A Level Global Perspectives and Research. Component 4

6. Unusual and Influential Data

A framework for predicting item difficulty in reading tests

Matthew Skellern. Updated version posted 4 November 2014

From where does the content of a certain geo-communication come? semiotics in web-based geo-communication Brodersen, Lars

Research proposal. The impact of medical knowledge accumulation and diffusion on health: evidence from Medline and other N.I.H.

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Correlation and Regression

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

Chapter 5: Field experimental designs in agriculture

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

Technical Track Session IV Instrumental Variables

Modeling Sentiment with Ridge Regression

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

How Much Should We Trust the World Values Survey Trust Question?

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

Media, Discussion and Attitudes Technical Appendix. 6 October 2015 BBC Media Action Andrea Scavo and Hana Rohan

C h a p t e r 1 1. Psychologists. John B. Nezlek

Validity and reliability of measurements

Transcription:

An Instrumental Variable Consistent Estimation Procedure to Overcome the Problem of Endogenous Variables in Multilevel Models Neil H Spencer University of Hertfordshire Antony Fielding University of Birmingham & Multilevel Models Project, Institute of Education, University of London The Problem of Endogeneity It is far from unusual for a multilevel model to contain a regressor that can be regarded as an endogenous variable. The term endogeneity as opposed to exogeneity is a familiar term in econometrics. Often it manifests itself by explanatory variable being subject to the same influences as the response variable. It is thus not exogenous in the model being fitted. More particularly it may mean that variables which are regarded as endogenous are not independent of the random effects in the model. In such circumstances a basic assumption of modelling is not met and obtaining consistent estimators of the parameters is not straightforward. Many standard multilevel procedures (e.g. Iterative Generalised Least Squares) rely on the independence of regressors and model disturbances for their consistency properties. Obviously in the case of the presence of endogenous explanatory variables this does not generally hold. We present here a modelling strategy based on instrumental variables and introduce a MlwiN macro that provides desired consistent estimators. We will consider an example in which we wish to use a multilevel model and where we suppose an explanatory variable may be endogenous. Fielding (1998) introduces a dataset drawn from children in primary schools of the City of Birmingham Local Education Authority. Data is available on a range of school and pupil characteristics but of prime importance are factors influencing the results of National Curriculum Key Stage 1 tests. In a simple model we may wish to relate one of these test results to gender and age of the child in months taking into account as controls the results of baseline tests carried out when the child entered reception classes in school. A model such as this may be used to examine the progress children are making in different schools. For pupil i in school j and where we have p baseline test results we may write a model which we label (1) as KS1TEST = β 0 + β GENDER 1 + β 2 AGE + p k = 1 BASELINE TEST(k) The term u in model (1) represents a random school effect and ε is a within school random pupil effect. The endogeneity in this model arises because the baseline tests may be supposed to be related to the random pupil effect through the existence of important unmeasured and unmeasurable influences acting at this lowest level of the hierarchy ( e.g. home circumstances). These influences are incorporated in the disturbance ε but may also influence baseline test performances. It is also possible that there are some influences that make the baseline tests related to the school effect + u j + ε

u j. The common influences may be such things as the locality in which the school is situated and from which the pupils generally come. Overcoming the Problem of Endogeneity Solutions to the problem of inconsistency caused by endogenous regressors, particularly when they are thought to relate to higher level effects such as u j, have been proposed by Kiviet ( 1995) and Rice et. al. ( 1998). Kiviet uses a bias corrected version of the least squares dummy variable estimator (LSDV). Rice use conditioned iterative generalised least squares ( CIGLS). The first of these approaches suffers from a problem that the bias correction applied may actually increase the bias in some circumstances. Neither approach can directly easily cope with the case where the level 1 (pupil ) random effect is correlated with regressors. It is this latter situation on which we mainly focus. In econometrics and other literature a frequently used method of overcoming such endogeneity problems in fixed effects models is to use instrumental variable techniques. We adapt these techniques to cover multilevel random effects models. The latter possibility has been mentioned briefly and independently by Rice et. al. (1998). Spencer ( 1997) also suggested such an approach for repeat testing in educational situations where explanatory variables are lagged versions of the response.. Here we operationalise an instrumental variable approach by constructing instruments for each of the endogenous variables ( baseline tests) in model (1) above. Separate multilevel models for each endogenous explanatory variable are constructed using regressors assumed exogenous and independent of the random part of model (1). Prediction estimates of the endogenous variables are then obtained from the fixed parts of the supplementary models. These predicted values, being independent of the random part of the model of interest (1), are used as instruments. Armed with data on the original set of regressors of model (1) and the set of instruments ( being the original regressor set with endogenous variables replaced by their instruments), we can conduct an instrumental variable estimation of the fixed effect parameters in model 1 ( see e.g. Bowden & Turkington (1984)). This provides us with consistent estimates of the fixed parameters but at this stage adequate estimates of their standard errors are not available. The next stage, then, is to obtain estimates of the random part of Model (1). This is done by using MlwiN procedures to create constraints on the fixed parameters. They are forced to be equal to those calculated from the instrumental variable procedure. The resulting estimates of the random part of the model can then be used to obtain standard errors of the instrumental variable fixed effects estimates. Mln and MlwiN macros called IV have been written to implement this procedure for obtaining consistent parameter estimates via instrumental variable estimation. They are available from the author's on request ( A.Fielding@bham.ac.uk, N.H.Spencer@herts.ac.uk ) or can be downloaded from the Birmingham web page www.bham.ac.uk/economics/staff/tony.htm. Preparatory work on the part of the user involves identifying which variables are to be endogenous and creating appropriate instruments for them. Model (1) or whatever is set up in the normal manner in MlwiN. The macros then need to have indicated to them which worksheet columns

the set of instrumental variables occupy. Following the set of simple instructions in the 'READ ME' file accompanying the macros does this. The Application We now use the example and data of Fielding ( 1998) discussed above to demonstrate the use and performance of the instrumental variable method embodied in the macros. The particular response variable used is the Mathematics Test level at Key Stage 1 standardised to have mean zero and unit variance. The seven baseline tests available in the data ( various forms of Mathematics and English tests) are inevitably highly correlated. To avoid potential multicollinearity and for heuristic purposes principal components are formed. The first principal component accounts for 60% of the variation in the data and none of the remaining components accounts for more than 8.5%. A decision was taken to use just the first principal component and this turns out to be a near equally weighted average of the seven original test scores. This variable was taken as endogenous and replaced the seven variables in model (1) above. Further, it was then modelled using a multilevel approach with intercept random effects for school and pupils with fixed effect explanatory dummy variables for the pupils ethnicity and first language. Excluded from use were some dummies that had been shown by Fielding (1998) to have a significant effect on pupil progress. Fielding also found that whether or not the pupil had been to a nursery school did have an impact on baseline performance but no significant effect on progress.. Thus a dummy for this variable was also used in forming the instrument.. The resultant model for the principal component of baseline scores was thus thought to provide predictions as an instrument, free of the problem of dependence on the disturbances of the original progress model of interest. Table 1 shows estimates of the fixed parameters ( and estimated standard errors)of the adapted model (1) obtained with and without the consistent instrumental variable estimation procedure ( IV). It is noticeable that the influence of gender and baseline testing decreases and that of age increases ( indeed almost doubles) when the consistent procedure is applied. Thus although the estimates from the two procedures are not radically different the interpretation of the results is affected by the choice of estimation method. Additionally using the instrumental variable technique gives us the added benefit of providing estimates which are consistent. Had we not used the consistent estimation procedure we would not have known the extent to which the potential problem of endogeneity was affecting the results that we were obtaining. Table 1: Results with and without Instrumental Variable Procedures Without IV With IV Coefficient for Estimate Est std err. Estimate Est. std. err. Intercept -0.0671 0.0520 0.0353 0.0611 Male gender 0.102 0.0244 0.0758 0.0335 dummy Centred age in 0.0145 0.00379 0.0281 0.00828 months Baseline Ist Principal 0.314 0.00775 0.211 0.0540

Component In Table 1 it is also noticeable that the estimated standard errors produced by the IV procedures are higher ( though not much higher ) than those produced without. A well known drawback of IV procedures is that if good instruments for the endogenous variables cannot be found, then the resulting estimates, although consistent, may be quite imprecise. In some cases standard errors can become so large so as to make results uninterpretable. This in turn has implications for good practice in data collection where modelling such as that described above intended. Sometimes information may not be collected on variables which may be important in this respect but are ignored because they are thought from theoretical and empirical considerations a priori to have no direct influence on a response The variable of interest. The relatively low standard errors for the IV estimators displayed in Table 1 may be noted. They were obtained because the instrument for the baseline principal component variable was constructed from a number of useful variables that were collected alongside the baseline results. These variables had no net direct effect on the response once other controls had been included and could not be part of unmeasured variation included in the random part of the model. Had these details of nursery attendance and ethnicity/first language not been available, the estimates of the first principle component used to construct its instrument would inevitably have been much cruder and larger standard errors than those in Table 1 might well have resulted. It is also possible that,if a much wider range of information relevant to baseline results had been available, the precision of the IV estimators might have been further improved. Experimentation with good instrument selection depends on the availability of such information. There is then a real lesson to be learned for study design. It is important when studies are planned, with ends similar to the above in view, that allowance is made for the necessity to have sufficient ( and possibly relevant ) background variables to use when constructing instruments. Conclusions The problem of inconsistency caused by the presence of endogenous variables in a multilevel model has been identified and a solution suggested by use of adaptations of instrumental variable procedures. The implementation of the consistent estimation method suggested has been made possible using the flexible macro facilities of MlwiN. These macros, available from the authors, make the implementation as a MlwiN procedure a fairly routine operation. An illustration of the method in action has been presented and the results contrasted with those produced when the problem of heterogeneity is ignored. Due to the availability of some reasonable variables with which to form instruments the IV procedure produced satisfactory efficient estimates. The importance of sound planning in decisions on data needs have been emphasised. References Bowden, R.J. and Turkington, D. A. (1984). Instrumental Variables, Cambridge, England, CUP Fielding, A. (1998). Why Use Arbitrary Points Scores? Ordered Categories in Models

of Educational Progress. Discussion Paper 98-23, Department of Economics, University of Birmingham Kiviet, J. F. (1995). On Bias, Inconsistency and Efficiency of Various Estimators in Dynamic Panel Data Models. Journal of Econometrics, 68, 53-78 Rice, N.., Jones, A. and Goldstein, H. (1998). Multilevel Models where the Random Effects are Correlated with the Fixed Predictors: A Conditioned Iterative Generalised Least Squares Estimator ( CIGLS). Multilevel Modelling Newsletter, 10, 1, 7-11 Spencer, N. H.. ( 1998) Consistent Parameter Estimation for Lagged Multuilevel Models, University of Hertfordshire Business School Technical Report 1, UHBS 1998:19