Does factor indeterminacy matter in multi-dimensional item response theory?

Similar documents
Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

Comprehensive Statistical Analysis of a Mathematics Placement Test

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

A simple guide to IRT and Rasch 2

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Item Analysis: Classical and Beyond

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Development, Standardization and Application of

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Turning Output of Item Response Theory Data Analysis into Graphs with R

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Confirmatory Factor Analysis. Professor Patrick Sturgis

Psychometrics in context: Test Construction with IRT. Professor John Rust University of Cambridge

Building Evaluation Scales for NLP using Item Response Theory

Technical Specifications

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Examining the Psychometric Properties of The McQuaig Occupational Test

Diagnostic Classification Models

Introduction to Item Response Theory

Techniques for Explaining Item Response Theory to Stakeholder

2 Types of psychological tests and their validity, precision and standards

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

Model fit and robustness? - A critical look at the foundation of the PISA project

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

Factor Analysis. MERMAID Series 12/11. Galen E. Switzer, PhD Rachel Hess, MD, MS

André Cyr and Alexander Davies

Structural Validation of the 3 X 2 Achievement Goal Model

Reanalysis of the 1980 AFQT Data from the NLSY79 1

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

Reveal Relationships in Categorical Data

A Comparison of Traditional and IRT based Item Quality Criteria

Development and Psychometric Properties of the Relational Mobility Scale for the Indonesian Population

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

The Effect of Guessing on Assessing Dimensionality in Multiple-Choice Tests: A Monte Carlo Study with Application. Chien-Chi Yeh

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Oak Meadow Autonomy Survey

Nonparametric IRT analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life Scale (WHOQOL-Bref)

Latent Variable Modeling - PUBH Latent variable measurement models and path analysis

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu

Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model

Chong Ho. flies through your eyes

linking in educational measurement: Taking differential motivation into account 1

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Having your cake and eating it too: multiple dimensions and a composite

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Item Response Theory. Robert J. Harvey. Virginia Polytechnic Institute & State University. Allen L. Hammer. Consulting Psychologists Press, Inc.

Using the Score-based Testlet Method to Handle Local Item Dependence

Running head: CPPS REVIEW 1

Scaling TOWES and Linking to IALS

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals


ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Re-Examining the Role of Individual Differences in Educational Assessment

Rasch Versus Birnbaum: New Arguments in an Old Debate

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

was also my mentor, teacher, colleague, and friend. It is tempting to review John Horn s main contributions to the field of intelligence by

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology

Information Structure for Geometric Analogies: A Test Theory Approach

Structural Equation Modeling (SEM)

An item response theory analysis of Wong and Law emotional intelligence scale

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Pitfalls in Linear Regression Analysis

BACKGROUND + GENERAL COMMENTS

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

I. DIMENSIONS OF TEMPERAMENT SCALE

Chapter 2. Test Development Procedures: Psychometric Study and Analytical Plan

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH

A PRELIMINARY EXAMINATION OF THE ICAR PROGRESSIVE MATRICES TEST OF INTELLIGENCE

Survey Sampling Weights and Item Response Parameter Estimation

Demographic Factors in Multiple Intelligence of Pre-Service Physical Science Teachers

Selection of Linking Items

During the past century, mathematics

Impact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing

Improved Transparency in Key Operational Decisions in Real World Evidence

Published by European Centre for Research Training and Development UK (

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Factor analysis of alcohol abuse and dependence symptom items in the 1988 National Health Interview survey

validscale: A Stata module to validate subjective measurement scales using Classical Test Theory

By Hui Bian Office for Faculty Excellence

The Modification of Dichotomous and Polytomous Item Response Theory to Structural Equation Modeling Analysis

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Analysis and Interpretation of Data Part 1

Transcription:

ABSTRACT Paper 957-2017 Does factor indeterminacy matter in multi-dimensional item response theory? Chong Ho Yu, Ph.D., Azusa Pacific University This paper aims to illustrate proper applications of multi-dimensional item response theory (MIRT), which is available in SAS's PROC IRT. MIRT combines item response theory (IRT) modeling and factor analysis when the instrument carries two or more latent traits. While it may seem convenient to accomplish two tasks simultaneously by employing one procedure, users should be cautious of mis-interpretations. This illustration utilizes the 2012 Programme for International Student Assessment (PISA) data set collected by Organization for Economic Cooperation and Development. Because there are two known sub-domains in the PISA test (reading and math), PROC IRT was programmed to adopt a two-factor solution. Additionally, the loading plot, dual plot, item difficulty/discrimination plot, and test information function plot in JMP were utilized to examine the psychometric properties of the PISA test. When reading and math items were analyzed in SAS's MIRT, 7-13 latent factors are suggested. At first glance these results are puzzling because ideally all items should be loaded into two factors. However, when the psychometric attributes yielded from a 2 parameter IRT analysis are examined, it is evident that both the reading and math test items are well-written. It is concluded that even if factor indeterminacy is present, it is advisable to evaluate its psychometric soundness based on IRT because content validity can supersede construct validity.. INTRODUCTION Item response theory (IRT) is a psychometric tool that can amend shortcomings of the classical test theory (CTT). Specifically, in CTT, estimations of item difficulty and person trait are sample-dependent; in IRT, however, estimations of item parameters and person theta are more precise (Embretson & Reise, 2000). IRT and its close cousin, Rasch modeling, assume uni-dimensionality (Yu, 2013). In other words, a test or a survey used with these approaches should examine only a single latent trait of the participants. In reality, many tests or surveys are multi-dimensional; to address this issue multi-dimensional IRT (MIRT) was introduced (Hartig & Hoher, 2009). To a certain extent, MIRT is a fusion of factor analysis and IRT. Although factor analysis and IRT share common ground in that some parameterizations of model parameters in factor analysis can be transformed into parameters in item response theory (Kamata & Bauer, 2008), the underlying philosophy of IRT is vastly different from that of factor analysis, which belongs to the realm of CTT. For example, the psychometric attributes of an instrument yielded from factor analysis attach to the entire scale. If one selects items from a validated scale, then the original psychometric properties will be damaged. In contrast, each item developed by IRT has its own characteristic (item difficulty parameter, discrimination parameter, guessing parameter, item information function, etc.). Hence, it is legitimate to generate an adaptive test by selecting items from an item bank. In addition, when responses are dichotomous (e.g. right or wrong ) instead of ordinal (e.g. Likert scale ratings), conventional factor analysis utilizing Pearson s correlation matrix is invalid. Nevertheless, throughout the last decade, numerous algorithms have been developed to make this fusion successful (Han & Paek, 2014). For example, in MIRT software packages the tetrachoric correlation matrix has replaced Pearson s correlation matrix. At first glance it is efficient to accomplish two tasks concurrently (identify the factor structure and the item characteristics). However, it is important to point out that sometimes IRT and factor analysis results might not concur with each other. Specifically, it is common that while IRT yields excellent psychometric properties of the test items, factor analysis failed to yield a sound solution. METHODOLOGY This illustration utilizes the 2012 Programme for International Student Assessment (PISA) data set collected by Organization for Economic Cooperation and Development (OECD, 2013). Every four years OECD delivers assessments in three subject matters (reading, math, and science) to its members and 1

partner countries. In 2012, 51,000 students from 65 countries participated in PISA. In order to rule out cultural difference as an extraneous variable, in this analysis only USA students are selected (n = 4978). However, not all students answered all test items; rather, different booklets were assigned to different students and the missing values were imputed to obtain the plausible value for each student. To simplify the illustration, a subset of US students and a subset of items are selected so that imputation is not necessary. As a result, only 411 students are retained. The original PISA exam contained 13 booklets and 206 items across the three domains. One may argue that both math and science require similar reasoning approaches and thus the two factors are indistinguishable in the data is expected. To illustrate factor indeterminacy in this large-scale assessment, the authors retain reading and math items only. According to Gardner (2006), linguistic skill and math skill belong to different dimensions of intelligence. Because there are two known sub-domains in the test (reading and math), PROC IRT in SAS was forced to adopt a two-factor solution. The loading plot, dual plot, item difficulty/discrimination plot, and test information function plot in JMP were utilized to examine the psychometric properties of the PISA test. The dual plot is a graphic display that places item parameters and student ability (theta) on the same scale, whereas the Test Information Function (TIF) is the sum of all item information functions in the test. RESULT Figure 1 depicts the scree plot and variance explained. Based on this information, as well as on the eigenvalues of the polychoric matrix of math and reading items, it seems that there are 13 underlying factors in the PISA test. Figure 2 displays the loading plot of all item vectors in a 2-factor solution. Obviously, all vectors are jammed together and no clustering pattern can be detected. Scree Plot Variance Explained 15 1.00 Eigenvalue 10 5 0.75 0.50 0.25 0 0.00 0 10 20 30 40 0 10 20 30 40 Cumulative Figure 1. Scree plot from PROC IRT for PISA math and reading items. 2

Figure 2. Loading plot of PISA items. As mentioned before, one may argue that there might be two latent traits (reading and math skills) that contribute to the test performance of a math test. When math items are analyzed in SAS s MIRT, seven latent factors are suggested (see Figure 3). Scree Plot Variance Explained 1.00 10 0.75 Eigenvalue 5 0.50 0.25 0 0.00 0 10 20 30 0 10 20 30 Cumulative Figure 3. Scree plot from PROC IRT for PISA math and reading items. At first glance these results are puzzling or even discouraging. However, when the dual plot (Figure 4) yielded from a 2 parameter IRT analysis is examined, it is evident that the math test items are wellwritten. First, the item difficulty level is evenly distributed; there is no extremely difficult or super-easy item. The student ability distribution also forms a fairly normal curve. More importantly, when the item and student attributes compared, it is obvious that the test difficulty matches the student ability. If the best students can outsmart the test then there will be no corresponding items on the left side of the dual plot. Conversely, if the test is too challenging there will be no corresponding student on the right side of the plot. But this dual plot shows that every item has a matching student and every student has a matching item. The item difficulty/discrimination plot (Figure 5) provides additional support for the psychometric soundness of the test by showing that all items have positive discrimination parameters. Further, the test 3

information function plot (Figure 6) shows that much information about users whose ability estimates concentrate around the center (zero) can be learned from the exam. The peak of the TIF is above zero, indicating that more information about the students with slightly above-average ability estimates can be obtained. Math is considered a challenging subject and this psychometric information presents an important contribution to the field of mathematics education. Figure 4. Dual plot of PISA math items. Figure 5. Item difficulty/discrimination plot of PISA math items and USA student ability estimates. Similar findings are observed in the IRT modeling of PISA reading items. Figures 7 to 9 show that the reading items also have fine psychometric properties. Specifically, both the item difficulty level and the student ability are normally distributed. The items are well balanced in that students cannot outsmart the test, but the test is also not overly challenging. Further, all discrimination parameters are positive and the test information function concentrates around zero. The peak of the TIF is slightly below zero, meaning that more information about the students whose reading skill is mildly below average can be obtained. Given all these IRT psychometric attributes, it is absurd to deny the validity of the PISA exam just because there is no clear factor solution. Further, when math and reading tests are separately evaluated in two IRT models it returns a single ability estimate for each student. But if MIRT model is used, then it will yield individual ability profiles as test results rather than single scores (Hartig & Hoher, 2009). The question is: could we obtain more useful information by adding this extra layer of complexity? 4

Figure 6. Test Information Function Plot of PISA math items. Figure 7. Dual plot of PISA reading items. Figure 8. Item difficulty/discrimination plot of PISA reading items and USA student ability estimates. 5

Figure 9. Test Information Function Plot of PISA reading items. CONCLUSION In the past, CTT and IRT were believed to be incompatible. As such, merging these methodologies seemed to be a remote dream. Nevertheless, with the advance of MIRT the dream became a reality. Indeed, combining these methodologies can remediate limitations in both camps. While traditional IRT allows for measurement of a single construct, infusing factor modeling enhances IRT with multidimensionality. In CTT the psychometric property is tied to the entire instrument; consequently, when a researcher wants to customize a scale to a specific population, he or she needs to repeat the tedious process of EFA and CFA. IRT provides users with the flexibility of generating an ad hoc, on-the-fly test. Despite the versatility of MIRT, it is not uncommon that PROC IRT could not yield a sound IRT model and a factor model at the same time. When factor indeterminacy is present, it is advisable to evaluate its psychometric soundness based on IRT alone because content validity can supersede construct validity. REFERENCES Embretson, S., & Reise, S. (2000). Item response theory for psychologists. New York, NY: Psychology Press. Gardner, H. (2006). Multiple intelligences: New horizons in theory and practice. New York, NY: Basic Book. Han, & Paek, I. (2014). A review of commercial software packages for multidimensional IRT modeling. Applied Psychological Measurement, 38, 486-498. Hartig, J., & Hohler, J. (2009). Multidimensional IRT models for the assessment of competencies. Studies in Educational Evaluation, 35, 57-63. Kamata, A. & Bauer, D. (2008). A note on the relation between factor analytic and item response theory models. Structural Equation Modeling, 15, 136 153. Organization for Economic Co-operation and Development [OECD] (2013). The PISA international database. Retrieved from http://pisa2012.acer.edu.au/ Yu, C. H. (2013). A simple guide to the item response theory (IRT) and Rasch modeling. Retrieved from http://www.creative-wisdom.com/computer/sas/irt.pdf ACKNOWLEDGMENTS Special thanks to Ms. Anna Lee and Ms. Samantha Douglas for their valuable input to this article. 6

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Chong Ho Yu, Ph.D., D. Phil. Azusa Pacific University cyu@apu.edu http://www.creative-wisdom.com/pub/pub.html SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7