Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

Size: px

Start display at page:

Download "Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi"

Allyson Moody
6 years ago
Views:

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT Amin Mousavi Centre for Research in Applied Measurement and Evaluation

1 Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT Amin Mousavi Centre for Research in Applied Measurement and Evaluation University of Alberta Paper Presented at the 2013 annual meeting of National Council of Measurement in Education San Francisco, California, USA April 26-30

2 Introduction During the last two decades, there has been an increased focus on educational quality and learning outcomes in the world. International educational surveys like TIMSS, PIRLS and PISA have provided a basis for assessing educational quality on different levels and corresponding influential factors. Using information provided by such large scale assessment, participating countries can explore weaknesses and strengths of their educational system comparing to other participating countries. International Association for the Evaluation of Educational Achievement (IEA) administers TIMSS and PIRLS for assessing Mathematics and Science and assessing Reading Literacy study. IEA s Trends in International Mathematics and Science Study (TIMSS) provides useful information about students mathematics and science achievement in an international context. TIMSS assesses students at the fourth and eighth grades, and also collects a wealth of data from the principals and teachers about curriculum and instruction in mathematics and science. Also, there is an advanced TIMSS designated for assessing school-leaving students with special preparation in advanced mathematics and physics. Participating countries in advanced TIMSS want internationally comparative data about the achievement of their students enrolled in advanced courses designed to lead into science-oriented programs in university. TIMSS uses the curriculum, broadly defined, as the major organizing concept in considering how educational opportunities are provided to students, and the factors that influence how students use these opportunities. Large-scale assessments like TIMSS accomplish a broad coverage of the targeted content domain by dividing the pool of items into blocks or clusters of items. Each student responds to one or more of these blocks, and so receives only a subset of what constitutes the total item pool. Under this design, each student responds to only a portion of the whole assessment in the form of a booklet. These test booklets are partially linked through blocks that occur in multiple test booklets. Therefore, each student responds to relatively small number of items meaning that the accuracy of measurement at the individual level of these assessments is considerably lower than when students are administered the full test. Common approaches to estimating individual proficiency, such as marginal maximum likelihood (MML) and expected aposteriori (EAP) estimates, are optimal for individual students, but not for classes and

3 schools.these approaches result in biased estimates of group-level results (von Davier et al, 2009). One way of solving this problem and having accurate group level estimates, is of using multiple values representing the expected distribution of a student s ability. These so-called plausible values (PV) allow unbiased estimation of the plausible range and the location of proficiency for groups of students. Plausible values are based on student responses to the subset of items they receive, as well as on other relevant and available background information (Mislevy, 1991). Plausible values can be viewed as a set of estimates generated using multiple imputations. Plausible values are not individual scores in the traditional sense, and should therefore not be analyzed as multiple indicators of the same score or latent variable (Mislevy, 1993). There are at least two approaches on analyzing data from large scale assessments which are briefly described in the next two parts. Hierarchical Linear Modeling (HLM) In social and behavioural research, often we work with data sets such that individuals are nested within some units/groups which this grouping nature can affect individuals scores. For instance, students are nested within classes or schools and properties of class or school like differences among teachers or school resources can affect students achievement performance. So, for analyzing this kind of nested data a researcher has to take into account the nature of data in order to extract more meaningful and accurate information about individuals. An acceptable and general statistical method for analyzing nested data requires Hierarchical Linear Modeling or Multilevel modeling. The use of multilevel models (Goldstein, 1995), also called hierarchical linear models (Snijders & Bosker, 2012, Raudenbush and Byrk, 2002), takes into account the nested structure of the data. Briefly, multilevel models model relationships between a dependent variable and a set of explanatory variables considering hierarchies of the data. The relative variation in the outcome measures (dependent variable), between students within the same school and between schools can therefore be evaluated. Multilevel models are used to make inferences about the relationships among explanatory and outcome variables at different levels. Since data from large scale tests like TIMSS are collected in a way that students are nested within classes; schools and countries, this implies that for analyzing data obtained by educational surveys we need

4 to use multilevel modelling. Usually, the outcome variables in most large scale tests like TIMSS, PISA, PERLS, are in form of plausible values, the common practice is to use plausible values and explanatory variables and analyze them using multiple imputations method in order to find final model estimates. In this way, test items are not directly used as performance indicators of individuals in the multilevel model. Another way for analyzing this type of data could be using test items as individuals performance indicators in a multilevel framework. Fox and Glas (2001, 2003) purposed a new method for multilevel modeling which takes into account the uncertainty of performance indicators at different levels using Item Response Theory (IRT) models in a multilevel framework as Multilevel Item Response Theory (MLIRT). Multilevel Item Response Theory (MLIRT) In most educational research, measurements are needed at the individual and group levels. Students examination results can be perceived as an indicator for the students abilities. They are measured with error. In summary, observed test data can be considered as indicators for the latent variables (i.e. ability). The latent variables can be integrated in a multilevel model. A multilevel IRT model extends the traditional IRT models, such that they regard variations of abilities between groups such as schools or classes, as well as within group units. Hence, a multilevel IRT model will distinguish the individual-level abilities and group-level abilities. For example, a multilevel extended two-parameter logistic IRT model for dichotomously scored items can be written as = (1) Where is the probability of a correct answer to i th item by person ; and are discrimination and difficulty parameter of i th item respectively and θ pg = ξ g + ζ pg. Here, θ pg is the ability of person p in group g, ξ g is the mean ability of group g, and ζ pg that is the deviation of person p from the group mean ability. This is one of the simplest forms of a multilevel IRT model. However, typical applications of multilevel IRT models involve explanatory variables in the model (Kamata & Vaughn, 2011). Fox and Glas (2001) extended this idea to multilevel linear modeling with the two-parameter normal ogive and graded response models as the measurement models. The multilevel model is implanted in the IRT framework to model the relationship

5 between observed individual and group characteristics and an outcome variable measured by dichotomous or polytomous items. Let y ijk denote the observed item response of the i th student in the j th school to item k. Let a two parameter IRT model relate the observed dichotomous item response with the students latent ability, θ ij, that is, = 1,, = Φ( ) (2) Where Φ is the cumulative normal distribution, a k and b k are the discrimination and difficulty parameter of item k, respectively. The latent ability can be articulated at level- 1 of a multilevel model as a liner combination of predictors at level-1: = (3) in which X Qij is explanatory variable at level-1. At level-2, each of the constant and regression coefficients can be expressed as a linear combination of predictors at level-2: = (4) Where W sqj is the S th explanatory variable at level-2. Both residuals, e ij and u qj, are assumed to be independently and normally distributed. The above equations define a multilevel IRT model, with a latent dependent variable measured by a two-parameter IRT model. Here, a latent variable is used as a dependent variable. Then, the IRT model can be seen as a level within the multilevel model. A multilevel IRT model can also consist of a multilevel model that defines the relation between different latent variables and various IRT models for measuring the latent variables. In summary, MLIRT is a method for incorporating IRT ability estimates using ogive normal model as outcome variable in a multilevel model. Traditional multilevel models assume that variables in multilevel models are measured without error and this can lead to biased estimates of multilevel model parameters but MLIRT is designed to be able to handle measurement error of the explanatory variables (Fox, 2004). This means that the uncertainty in the measurements of the latent variables is taken into account in the estimation of the other model parameters. Usually, individual abilities or group characteristics are estimated and imputed in the multilevel analysis and the measurements are supposed to be observed

6 errorless. As a result, the estimated regression coefficients are biased and their standard deviations are too small (Fox, 2004). The standard multilevel software (MLwiN; HLM; ) cannot be used to estimate simultaneously all parameters of a multilevel IRT model therefore free software called mlirt under R program(r Development Core Team, 2012) is developed for estimating MLIRT models (Fox, 2003). This package can be used for analyzing dichotomous or polytomous responses. This method is applicable to the data from large scale assessments like TIMSS. Fox (2007) used the mlirt package for analyzing data from PISA 2003 and compared its results with HLM 5 program (Raudenbush et al, 2000). He showed that estimates from mlirt package and HLM are close but estimated variance components using MLIRT are greater than HLM using plausible values which is due to taking into account the measurement error of explanatory variables in mlirt package. So, the aim of this study is to replicate Fox s study using advanced TIMSS 2008 data for more investigation on MLIRT method purposed by Fox and Glas. Methodology Data/participants: Data from advanced TIMSS 2008 mathematics (IEA, 2008) from IRAN was used to compare the two procedures. There were 2,362 students (60.6% male (coded as 0), 39.4% female (coded as 1)) nested within 116 schools in the data set. Achievement test and questionnaires.the advanced mathematics assessment framework for TIMSS Advanced 2008 was organized around two dimensions: a content dimension specifying the subject matter to be assessed within mathematics (i.e., algebra, calculus, and geometry) and a cognitive dimension specifying the thinking processes to be assessed (i.e., knowing, applying, and reasoning). The items were included in four linked booklets (Arora et al, 2009). While both dichotomously and polytomously scored items were included in the assessment, just dichotomous items were used in the present study. The data for the explanatory variables was obtained from the student, teacher and school questionnaires. Multilevel modeling: Three HLM models were considered in this study: 1) Null model (M0): This is a model without any explanatory in the model. A null model can be denoted as Level-1: = + Level-2: = +

7 2) Model1 (M1): This is a random intercept model with two explanatory variables at level-1, attitude towards math (AM) and student s gender (SEX). This model can be denoted as Level-1: = + ( )+ ( ) Level-2: = + = = 3) Model2 (M2): This is a random intercept model with two explanatory variables at level-1, attitude towards math (AM) and student s gender (SEX) and one level-2 explanatory variables, school resources for teaching math (RESOURCES. This model can be denoted as Level-1: = + ( )+ ( ) Level-2: = + ( ) = = All the explanatory variables were assumed to have fixed effect across level-2 units. For the purpose of comparison, in addition to plausible values, sum of raw scores of the students responses on the mathematics test were considered as outcome variable in HLM. Also, in order to make the results from the two methods (i.e. MLIRT and HLM) comparable, outcome variables entered into HLM were standardized with mean of 0 and standard deviation of 1 so the results derived from HLM and MLIRT could be comparable because in the mlirt package ability estimates were assumed to have a mean of 0 and standard deviation of 1. Software/Estimation method: for analyzing traditional multilevel models HLM 6 program (Raudenbush, Bryk, Cheong, &Congdon, 2004) was used using plausible values and standardized sum of raw scores as outcome variables. In HLM, analyzing plausible values is done by multiple imputations. For each student, there were five plausible values in the data set then HLM performs multilevel analysis for given model for each plausible value. The final estimate for each parameter was computed based on following steps (Snijders and Bosker, 2012): First, from the multiple imputations the average estimate is

8 = ( ) (5) where, M is the number of imputations (i.e. in this case equals 5) and is the estimate. Then the average within data set variance and between-imputation variance can be achieved via =.. ( ) (6) and = ( ( ) ) (7) where Then the standard error of estimate is.. ( ) = (8) The milrt 2.0 package developed by Fox (2010) under the R environment (R Development Core Team, 2012) was used to compute the results for the multilevel item response theory model. This package uses Monte Carlo Markov Chain (MCMC) algorithm for parameter estimation in the context of Bayesian analysis. Common normal priors are specified for item parameters as (, )~ (,Σ). A Gibbs sampler is used to simulate draws from the conditional posterior distribution for binary responses. Estimation method in the mlirt package is complex and intensive so for more details readers can refer to Fox (2007). Results The parameter estimates for the null model (M0) resulted from the two methods are reported in Table 1. In this table, γ is the intercept of level-2,σ is the variance of outcome variable at level-1, τ is the variance of outcome variable among schools and ρ is the estimated intraclass correlation.

9 Table1: Estimates of model parameters for null model (M0) HLM-PV HLM-Raw MLIRT Fixed part Estimate S.E. Estimate S.E. Estimate S.E. (INTERCEPT) Random part *. (p 0.05) * * * Results from MLIRT are quite different than traditional multilevel model for null model. In fixed part, all estimates are close to zero which is due to standardization of outcome variable for HLM and the scale of ability estimation in MLIRT. In random part, sigma-square estimate using plausible values is smaller than sigma-square estimate using the raw scores whereas the estimate from MLIRT is higher than the others which was expected because the MLIRT takes into account the measurement error of explanatory variables as well. The Tau estimate using raw score is about half of estimate using plausible values indicating that there is more observed variation between schools when using plausible values, and these two are higher than the estimate from MLIRT. Finally, the intraclass correlation coefficients for the plausible values and raw scores indicate that about 49% and 26.5% of variation of outcome variable is due to nesting students within schools respectively. This amount decreases sharply down to 5.2% for MLIRT model. Table2 shows parameter estimates for the model1 (M1) resulted from the two methods. In this table, γ and γ are leve-2 intercepts of corresponding level-1 explanatory variables.

10 Table2: Estimates of model parameters for model1 (M1) HLM-PV HLM-Raw MLIRT Fixed part Estimate S.E. Estimate S.E. Estimate S.E. (INTERCEPT) (AM) (SEX) * * * Random part *. (p 0.05) The results reported in Table 2 show that the explained variance using HLM varies from 24.8% for raw scores to 47.8% for plausible values but the explained variance using MLIRT is just 4.1%. In the fixed part, AM is not significant (p>0.05)across methods and different outcome variables. In case of SEX, it is significant (p<0.05) across methods and outcome variables. The negative sign of estimate for SEX indicates that males outperformed females. Pseudo R-squares based on the formula provided by Snijders & Bosker (2012) after adding the two explanatory variables compared to the null model for plausible values, raw score and MLIRT are 0.026, and 0.006, respectively. Table3 shows parameter estimates for the model2 (M2) for the two methods. In this table, is the regression coefficient of RESOURCES at level-2.

11 Table. 3: Estimates of model parameters for model2 (M2) HLM-PV HLM-Raw MLIRT Fixed part Estimate S.E. Estimate S.E. Estimate S.E. (INTERCEPT) Student (AM) (SEX) * * * School (RESOURCES) * * * Random part *. (p-value 0.05) * * * From Table3, it can be seen that intraclass correlation coefficients are less than the intraclass correlation coefficients for M0 and M1. The amount of explained variance for plausible values andraw scores are 45.8%and 22.4% respectively and for MLIRT is 3.8%. In the fixed part, RESOURCES and SEX are significant (p <0.05) across methods and outcome variables. Discussion and Conclusion The observed difference between MLIRT and HLM could be due to the different outcome variables. Although results from plausible values and raw scores are not very close, they both differ from the results of MLIRT. This could be due to distribution of estimated thetas from MLIRT as shown in Figure 1 which demonstrates the estimated probability density functions obtained by kernel density estimation (Silverman, 1986) of outcome variables used in HLM and MLIRT. The use of mean of plausible values in the following graphs is just for demonstration and prevention of a confused graph because of using five plausible values in one graph.

12 Figure.1: Distribution of estimated theta from MLIRT against mean of plausible values and sum score From the Figure 1, it can be clearly seen that estimated density functions of the mean of plausible values (i.e. mean of PVs in solid line) and sum scores (i.e. dotted line) are closer to each other but the estimated density function of MLIRT (i.e. dashed line) is different and bi-modal. This difference can affect multilevel parameter estimates. For more investigation, a dummy variable generated based on the theta distribution in which values smaller than 0 coded as 1 and values equal to and greater of 0 coded as 2. Using this student level explanatory variable, the intraclass correlation in MLIRT analysis increased up to 12.8% suggesting that there is another clustering variable which is not taken into account in MLIRT analysis. One possible clustering factor could be multidimensionality of the test because in MLIRT it is assumed that the data are one-dimensional. Also, low discrimination indices obtained from MLIRT (i.e. mean=0.66, variance=0.40, range = ) suggest more than one dimension. Results of a dimensionality analysis using mirt package under the R program (Chalmers, 2012) revealed that a two factor model fits better the data (Table4). Tucker-Lewis index (TLI) of fit and Root Mean Square Error of Approximation (RMSEA) suggest that two-factor model fits better and in comparison with the three-factor model, the two-factor model is more parsimonious.

13 Table4: Dimensionality analysis of the data One factor Two factor Three factor Log-Likelihood AIC BIC TLI RMSEA Another possible source of having this clustering factor could be due to the complexity of the data and sampling design used in TIMMS. Following to the results obtained by Fox (2007) using PISA data and finding close estimates from the two methods, plotting a kernel density estimate of the estimated thetas from Fox s study shows a fairly unidimentional normally distributed thetas which a close distribution to the distribution of mean of PVs (Figure.2). Figure.2: Distribution of estimated theta from MLIRT of Fox(2007) study

14 It seems that the main difference between the results of the current study and Fox (2007) on comparing HLM and MLIRT is due to the difference between estimated latent abilities by the mlirt package. To sum up, even though the MLIRT approach seems promising in the analysis of data from large scale assessments but it should be noted that the accuracy of the estimates highly depends on meeting the assumption on the used analytical method. For advanced TIMSS 2008 data used in this study, it seems that, for some reasons, plausible values could capture the underlying grouping variable in the data while MLIRT couldn t so if a researcher wants to use MLIRT needs to take into account the possibility of existence of such factors in the data and including it into the multilevel model which is not easy. On the other hand, the similarity between the theta and mean of PVs distributions in Fox (2007) study indicates that under some circumstances there is no need to generate plausible values and the estimated theta values can provide credible results ( assuming that plausible values can provide credible results). However, results suggest that there is an obvious need for more investigation on the merit of using MLIRT or HLM for analyzing data from large scale assessment.

15 References Arora, A., Foy, P., Martin, M. O., & Mullis, I. V. S. (2009). TIMSS advanced 2008 technical report. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Chalmers, R. Philip. (2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), URL Fox, J.-P.(2004). Applications of Multilevel IRT Modeling. School Effectiveness and School Improvement, 3-4. Fox, J.-P., &Glas, C.A.W. (2001).Bayesian estimation of a multilevel IRT model using Gibbs sampling.psychometrika, 66, Fox, J.-P., &Glas, C.A.W. (2003).Bayesian modeling of measurement error in predictor variables using item response theory.psychometrika, 68, Fox, J-P. (2007). Multilevel IRT Modeling in Practice with the Package mlirt.university of California at Los Angeles, Department of Statistics. Fox, J-P. (2010). mlirt: Multilevel item response theory (mlirt) modeling. R package version Goldstein, H. (1995). Multilevel statistical models (2nd ed.). London: Edward Arnold. Kamata, A., & Vaughn, B. K.(2011). Multilevel IRT Modeling. In Hox, J. J., & Roberts, J. K., Handbook of Advanced Multilevel Analysis, Psychology Press. Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56(2), Mislevy, R. J. (1993). Should multiple imputations be treated as multiple indicators?,psychometrika, 58(1), R Core Team (2012). R: A language and environment for statistical computing. RFoundation for Statistical Computing, Vienna, Austria. ISBN , URL Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods. Thousand Oaks: Sage Publications. Raudenbush, S.W., Bryk, A.S., Cheong, Y.F., &Congdon, R.T., Jr. (2000). HLM 5.Hierarchical Linear and nonlinear modeling.lincolnwood, IL; Scientific Software International. Raudenbush, S.W., Bryk, A.S., Cheong, Y.F., &Congdon, R.T., Jr. (2004). HLM 6.Hierarchical Linear and nonlinear modeling.lincolnwood, IL; Scientific Software International.

16 Silverman, B. W. (1986). Density estimation. London, England: Chapman and Hall. Snijders, T. A. B., &Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling. Los Angeles: Sage. Von Davier, M., Gonzalez, E., & Mislevy, R. (2009). What are plausible values and why are they useful. IERI Monograph Series Volume, 2, 9-3

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting