Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

University of Groningen Latent instrumental variables Ebbes, P. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2004 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 22-09-2018

Chapter 1 Introduction In this thesis we propose a new method to estimate regression coefficients in linear regression models where regressor-error correlations are likely to be present. This method, the Latent Instrumental Variables (LIV) method utilizes a discrete latent variable model that accounts for dependencies between regressors and the error term. As a result, observed exogenous instrumental variables are not required. In the following chapters we introduce and illustrate the LIV method on both simulated data and empirical applications. We show that the LIV method has desirable properties over existing methods, such as ordinary regression and instrumental variables methods, when regressor-error dependencies are present. Each chapter is more or less self-contained and based on articles. In the following we present the scope and outline of the thesis. The starting point of this research is the simple linear regression model given by y i = β 0 + β 1 x i + ɛ i, (1.1) where y i is the dependent variable, x i the explanatory variable (regressor), and ɛ i is the error term or disturbance with mean zero and variance σ 2, all independent. The regression parameters β 0 and β 1 are the objects of inference. We focus on a situation where the regressor is random and possibly correlated 1

2 Chapter 1 Introduction with the disturbance 1, in which case it is not exogenous but endogenous. Regressor-error correlations may be the result of several causes and arise in a wide variety of models, e.g. when relevant explanatory variables are omitted, when the dependent variable influences the explanatory variable (simultaneity), when the sampling process is non-random (self-selection), or when the explanatory variable is measured with error. The standard inferential methods are invalid if regressor-error dependencies exist. For instance, the ordinary least squares estimator for the regression parameters (β 0, β 1 ) suffers from inconsistency, in which case the true effect of the explanatory variable on the dependent variable is systematically over- or underestimated, leading to false conclusions and erroneous decision making. The instrumental variables (IV) methods were developed to overcome these problems and have a long history in econometrics (Bowden and Turkington, 1984, Greene, 2000, or Judge et al., 1985). Instruments z are variables that mimic the endogenous regressor x as well as possible, but are uncorrelated 2 with the error term ɛ. Once valid instruments are available, the regression parameters can be consistently estimated via, for instance, two-stage least squares techniques. However, finding exogenous instruments is hard work, and empirical researchers are often confronted with weak instruments. An instrument is weak when it only weakly correlates with the endogenous regressors. If instruments are weak and/or not exogenous, the standard instrumental variables estimation and inferential procedures are inaccurate and produce bad results, that are potentially worse than simply ignoring the endogeneity problem and relying on biased ordinary least squares. Hence, small biases in ordinary least squares estimates can become large biases when invalid instruments are used (Stock, Wright, and Yogo, 2002, or Hahn and Hausman, 2003) 3. Besides the problems of potential weak and/or endogenous instruments, these variables may simply not be available to a researcher, whereas collecting them is time 1 At least in the weak sense that plim i x i ɛ i 0, or that E (x i ɛ i ) 0 implying E (ɛ i x i ) 0, e.g. White (2001) or Ferguson (1996). 2 The instrument is said to be exogenous. 3 This was already observed by Sargan in the 1950s, see e.g. Arellano (2002).

3 consuming and expensive. The main purpose of this research is to develop a new method (the latent instrumental variables (LIV) method) that does not require observed instrumental variables at hand. As such, the difficult task of finding instruments and the inferential issues in presence of bad quality instruments are circumvented. In fact, the optimal LIV instruments are estimated as a by-product from the available data. The above discussion on the problems surrounding instrumental variables estimation is considered in greater detail in chapter 2. The literature review presented in this chapter covers most of the recent studies on weak instruments and contains several references to empirical research (labor economics, marketing, industrial economics) that aims at solving regressor-error dependencies. Furthermore, we point out a few alternative approaches to instrumental variables estimation that may be useful in solving regressor-error dependencies. This overview of the literature is a selection of issues that motivates the development of the latent instrumental variables (LIV) method. We conclude chapter 2 by highlighting the relevance and contribution of this research. In chapter 3 we introduce the latent instrumental variable (LIV) model. It solves regressor-error correlations in linear models by postulating that the instrumental variable is discrete and latent. As a byproduct, the method allows for testing for endogeneity without requiring access to observable instruments. Our simulation results show that the LIV method yields consistent estimates for the model parameters without having observable instrumental variables at hand. These results are superior to OLS estimates which are biased when the regressors are not exogenous. The proposed test statistic to test for exogeneity is shown to have a reasonable power throughout a wide range of settings. Furthermore, we prove identifiability of all model parameters. We apply the LIV method to an empirical measurement error application where a laboratory dummy instrumental variable is available. We show that the predicted LIV dummy instrument is identical to this observed laboratory instrument. Hence, the LIV estimate for the regression parameter, without using the observed instrument, is identical to the classical IV estimate that does require the

4 Chapter 1 Introduction existence of an observed instrument. We conclude that our instrument-free approach can be successfully used to estimate regression parameters in presence of regressor-error correlations, and to test for this dependency without the necessity of first finding valid instruments. The method proposed in chapter 3 is extended in chapter 4 to more general settings. We extend the model to a situation where several exogenous regressors are available. Furthermore, we allow for the possibility that observed instrumental variables are available. Using similar techniques as for the more simple LIV model, we prove that all model parameters can be identified. Importantly, from this proof it follows that the general LIV model is still identified, even when possible observed instruments have no or very small effects on the endogenous regressor. In such a case, the classical IV model is unidentified or weakly identified, respectively. This identifiability result suggests a straightforward approach to examine instrument weakness, that is based on existing testing principles. Furthermore, using a similar reasoning, it suggests a straightforward test of instrument exogeneity (validity). To the best of our knowledge, such tests to independently investigate instrument exogeneity and weakness for each instrument have not appeared in the literature before. We illustrate both tests by the means of a simulation example and show that the proposed tests have a reasonable power under a variety of settings. Besides, we propose several diagnostics to complete an LIV analysis. We propose several statistics to choose among the number of categories of the discrete LIV instrument. Furthermore, we examine the robustness of the LIV estimates towards misspecification of the likelihood equation and suggest how to examine residuals. We adapt standard methods from regression models to detect outliers and influential observations. The proposed LIV model, tests, and diagnostics are applied in chapter 5. We examine the effect of education on income, where the variable education is potentially endogenous due to omitted ability or other causes. We review part of the schooling literature and discuss the problems associated with classical instrumental variables estimation. As will become clear, the classical IV

5 method has produced a less than satisfactory solution in estimating the return to education. Importantly, researchers who use different sets of instruments arrive at different conclusions in terms of size and magnitude of the bias found in the OLS estimate for the return to eduction. We examine three empirical datasets. In all three applications, we find an upward bias in the OLS estimates of approximately 7%. Our conclusions agree closely with recent results obtained in studies with twins that find an upward bias in OLS of about 10% (Card, 1999). Diagnostic evaluations demonstrate that the LIV method provides a satisfactory fit of the data. We also find that for each of the three datasets the classical IV estimates for the return to education point to biases in OLS that are not consistent in terms of size and magnitude. The proposed diagnostics and tests to examine the validity of available observed instruments indicate that in two of the three datasets the used instruments are potentially weak and/or endogenous. Our conclusion is that LIV estimates are preferable to the classical OLS and IV estimates in understanding the effects of education on income. In chapter 6 we consider endogeneity problems in multilevel models, i.e. when data has an hierarchical structure. As before, the explanatory variables are assumed to be independent of the random components at various levels. However, in many applications this is an unrealistic assumption. When the same cross-section units are observed over time, for instance, or when data on siblings or twins is available, multilevel models may in fact be used to solve regressor-error correlations at a lower level. In this chapter we show that much care is required in relying on these methods in actual applications. We review methods that can be used to test for different types of random effects regressor dependencies. Secondly, we present results from Monte Carlo studies designed to investigate the performance of these methods, and, finally, we discuss estimation methods that can be used when some, but not all of the random effects regressor independence assumptions are violated. Because current methods are limited in various ways, we will also present a list of open problems and suggest solutions for some of them. As we will show, the issue of regressor random effects independence has received some attention in the

6 Chapter 1 Introduction econometrics literature, but this important work has had little impact on current research practices in the social and behavioral sciences. In chapter 7 we take parts of the results of chapter 6 a step further and develop sophisticated nonparametric Bayesian methods (Dey, Müller and Sinha, 1998) to solve regressor-error dependencies in multilevel models at various stages of the model. This method solves some of the problems addressed in chapter 6 and is a generalization of the standard LIV model in the sense that we do not impose restrictions (discreteness) on the distribution of the instruments. In fact, we let the data determine the best distribution. This is an important advantage as it does not require an a priori specification of the right number of categories of the unobserved discrete instrument. Because we take fully advantage of Bayesian estimation methods, the proposed model can readily be adapted and extended to more general and more complex model structures. Furthermore, insight in small sample properties of the estimation results is more easily obtained and inference does not rely on asymptotic results. This chapter is still work-in-progress and the results are preliminary, yet promising. We illustrate the potential usefulness of this approach to regressor-error dependencies and suggest steps for further research. In chapter 8 we present a discussion of the proposed LIV method and the results found. Furthermore, we present future research directions.