Models in Educational Measurement

Similar documents
Basic concepts and principles of classical test theory

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Comprehensive Statistical Analysis of a Mathematics Placement Test

Having your cake and eating it too: multiple dimensions and a composite

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Development, Standardization and Application of

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

CEMO RESEARCH PROGRAM

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Technical Specifications

Paul Irwing, Manchester Business School

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

By Hui Bian Office for Faculty Excellence

linking in educational measurement: Taking differential motivation into account 1

Re-Examining the Role of Individual Differences in Educational Assessment

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2

André Cyr and Alexander Davies


THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

Measurement Invariance (MI): a general overview

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Does factor indeterminacy matter in multi-dimensional item response theory?

Measurement Issues in Concussion Testing

Examining the Psychometric Properties of The McQuaig Occupational Test

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH

Chapter 1: Explaining Behavior

2 Critical thinking guidelines

10 Intraclass Correlations under the Mixed Factorial Design

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

The relation between fluid intelligence and the general factor as a function of cultural background: a test of Cattell s investment theory

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

CHAPTER VI RESEARCH METHODOLOGY

Psych 1Chapter 2 Overview

Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model

The Modification of Dichotomous and Polytomous Item Response Theory to Structural Equation Modeling Analysis

Item Analysis: Classical and Beyond

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Martin Senkbeil and Jan Marten Ihme

Reliability and Validity of the Hospital Survey on Patient Safety Culture at a Norwegian Hospital

Work, Employment, and Industrial Relations Theory Spring 2008

Evaluating the quality of analytic ratings with Mokken scaling

Numerical Integration of Bivariate Gaussian Distribution

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Methodological Issues in Measuring the Development of Character

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Introduction to Test Theory & Historical Perspectives

Diagnostic Classification Models

ASSESSING THE UNIDIMENSIONALITY, RELIABILITY, VALIDITY AND FITNESS OF INFLUENTIAL FACTORS OF 8 TH GRADES STUDENT S MATHEMATICS ACHIEVEMENT IN MALAYSIA

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Parallel Forms for Diagnostic Purpose

Analyzing Teacher Professional Standards as Latent Factors of Assessment Data: The Case of Teacher Test-English in Saudi Arabia

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

2 Types of psychological tests and their validity, precision and standards

Chapter 8: Estimating with Confidence

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus.

A critical look at the use of SEM in international business research

Influences of IRT Item Attributes on Angoff Rater Judgments

Good Assessment by Design

was also my mentor, teacher, colleague, and friend. It is tempting to review John Horn s main contributions to the field of intelligence by

On the purpose of testing:

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

Scale Building with Confirmatory Factor Analysis

ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

The MHSIP: A Tale of Three Centers

Measuring noncompliance in insurance benefit regulations with randomized response methods for multiple items

Research Approach & Design. Awatif Alam MBBS, Msc (Toronto),ABCM Professor Community Medicine Vice Provost Girls Section

CSC2130: Empirical Research Methods for Software Engineering

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

The Psychometric Principles Maximizing the quality of assessment

Psychological testing

Proof. Revised. Chapter 12 General and Specific Factors in Selection Modeling Introduction. Bengt Muthén

Intelligence What is intelligence? Intelligence Tests and Testing

2. Which pioneer in intelligence testing first introduced performance scales in addition to verbal scales? David Wechsler

Long Term: Systematically study children s understanding of mathematical equivalence and the ways in which it develops.

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

investigate. educate. inform.

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Double perspective taking processes of primary children adoption and application of a psychological instrument

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Modeling the Influential Factors of 8 th Grades Student s Mathematics Achievement in Malaysia by Using Structural Equation Modeling (SEM)

Reliability & Validity Dr. Sudip Chaudhuri

Convergence Principles: Information in the Answer

Maike Krannich, Odin Jost, Theresa Rohm, Ingrid Koller, Steffi Pohl, Kerstin Haberkorn, Claus H. Carstensen, Luise Fischer, and Timo Gnambs

Multiple Act criterion:

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

UvA-DARE (Digital Academic Repository)

How Do We Assess Students in the Interpreting Examinations?

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Assessments: Concept and History

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Transcription:

Models in Educational Measurement Jan-Eric Gustafsson Department of Education and Special Education University of Gothenburg

Background Measurement in education and psychology has increasingly come to rely on explicitly formulated statistical models. Statistical models offer power, precision and flexibility by focusing on a few essential aspects of the phenomenon under study. But if the model assumptions do not agree with the phenomenon, inferences may be incorrect.therefore, issues of model fit are centrally important. However, it is not always an easy task to determine whether a model fits data or not, or which the consequences are of model misfit.

The Rasch model According to the Rasch model the probability of a correct response to a test item is a function of the ability of the person, and of the difficulty of the item: As the ability of the person increases, the probability of a correct answer increases. As the difficulty of the item increases, the probability of a correct answer decreases. Under certain assumptions, it is possible to estimate the difficulty of items independently of the ability of persons, and to estimate the ability of persons independenty of the difficulty of the items.

The Rasch model, cont Having estimated the difficulty parameters of a set of items they can be used to construct different tests, which all measure the same ability. Different test-takers can be given different items to measure one and the same ability. This offers great advantages for solving practical measurement problems, such as adaptive testing, horizontal and vertical linking of tests, and constructing matrix-sampling test designs.

Does the model fit data? The Rasch model is attractive, because if it fits data, the properties of the model guarantee simple and powerful solutions of practical measurement problems. Assumptions: Unidimensionality Homogeneous discrimination of all items These assumptions may be tested with statistical tests constructed within the framework of the Rasch model.

Gustafsson, J. E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematical and Statistical Psychology, 33(2), 205-233. Two categories of statistical tests: ICC-tests investigate if item parameters are invariant across subsets of persons (e.g., high scorers vs. low scorers; boys vs girls) PCC-tests investigate if ability parameters are invariant across subsets of items Some results: ICC-tests do not detect multidimensionality. Muthén (1978) analyzed 15 locus of control items with a new method for factor analysis of dichotomous variables and found three lowly correlated factors. However, according to an ICC test the Rasch model had good fit to these data. The PCC-test supported the conclusion that there were three separate dimensions. The statistical power of the ICC-tests is strongly dependent on sample size and on the heterogeneity of the sample. The Rasch model does not fit speeded tests or tests which allow guessing

Gustafsson, J. E. & Lindblad, T. (1978) The Rasch model for dichotomous items: A solution of the conditional estimation problem for long tests and some thoughts about item screening procedures. Paper presented at the European Conference on Psychometrics and Mathematical Psychology, Uppsala, June 15 17, 1978. The Rasch model was used to analyze a test of English grammar for Swedish students. The model had poor fit, which was primarily due to a set of items measuring knowledge of irregular verbs having too high discrimination. In separate analyses, good fit was found for the irregular verb items, as well as for the other items, after some poorly constructed items had been excluded.

What to do when model fit is poor? Exclude the offending items This would have caused unacceptable construct underrepresentation. It also would have been illogical because too high or too low discrimination typically is not an intrinsic characteristic of an item, but rather whether the other items have similar discrimination or not Put the problematic items in a separate scale This would have been impractical, unless we aimed to differentiate between different domains of English grammar. But to do that reliably we would need more items testing irregular verbs Turn to another, less restrictive, model such as Verhelsts OPLM model which is a Rasch model which allows different but fixed discrimination parameters. This model was not developed at the time Keep the items in the test and accept the poor fit This could imply loss of credibility

Robustness George Box: Essentially, all models are wrong, but some are useful Many applications of the Rasch model and other IRT models, such as TIMSS and PISA define both an overall score, and subscores for different domains or processes. This must be a violation of the unidimensionality assumption. Still, this practice seems meaningful and useful. Coefficient α is often described as not being based on any strict assumptions but the formula is in fact based on the same assumptions as the Rasch model: unidimensionality and homogeneous item discrimination. If the assumptions are violated, α is underestimated. However, even in the presence of large variation in the item discrimination, the underestimation is marginal (e,g,, Reuterberg & Gustafsson, 1992)

Some conclusions It is difficult to assess the fit of the Rasch model It is even more difficult to develop well-fitting models There is a risk of conflict between the model requirements and the validity of the test Use of the model needs to rely on trust in robustness The capability of the Rasch model to deal with issues of multidimensionality is limited

Dimensionality of cognitive abilities Factor analysis was invented to investigate the dimensionality of variables: Spearman invented factor analysis to test the hypothesis that individual differences in cognition can be captured by a g-factor Thurstone invented exploratory factor analysis and demonstrated that there are seven primary mental abilities. Followers of Thurstone extended this number to at least 100 primary abilities Cattell applied factor analysis to correlations among factors to identify second- and third-order factors and introduced the distinction between Fluid Intelligence (Gf) and Crystallized Intelligence (Gc) Jöreskog developed confirmatory factor analysis and structural equation modeling allowing flexible and powerful building and testing of latent variable models

Gustafsson, J. E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8(3), 179-203.

Does the model hold? The g = Gf relation was replicated in several studies, but far from all Carroll s (1993) meta-analysis did not replicate the perfect relation but it showed that Gf was the broad ability most highly related to g.

Valentin Kvist, A., & Gustafsson, J. E. (2008). The relation between fluid intelligence and the general factor as a function of cultural background: A test of Cattell's investment theory. Intelligence, 36(5), 422-436. The correlation between g and Gf was.83 in a heterogeneous group of adults, but correlations of.98-.99 were found within each of the three sub-groups Non-immigrants, European immigrants; Non-European immigrants. Explanation: Gf is a determinant of learning in all domains, along with motivation, effort and opportunity to learn When everyone has equal opportunity to learn, Gf influences learning and development in all domains, and so it becomes a general factor. If subgroups of persons have had different opportunities to learn certain domains, the generality of Gf breaks down. These results provide general support for Cattell s Investment theory

The Investment theory The Investment theory basically says that Gf is a causal factor in the development of individual differences in learning. If we knew the mechanisms through which Gf influences development of fundamental skills such as decoding and vocabulary we would have a better basis for educational interventions.

Methodological aspects of the hierarchical model The standard view of measurement implies that the phenomenon can be described in terms of a set of correlated dimensions, which all are unidimensional. However, some constructs are broad and encompass a very wide range of phenomena (e.g., g), others are broad and encompass wide domains of phenomena (e.g., Gc) while other constructs are narrow and encompass a more limited range of phenomena (e.g., knowledge of irregular verbs). The constructs differ in referent generality. When the unidimensionality requirement is imposed this has implied a focus on constructs with narrow referent generality. Typically it has had the consequence that broad constructs have been splintered into more and more narrow constructs, as happened in the research on cognitive abilities.

Gustafsson, J.E. (2002). Measurement from a hierarchical point of view. In In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.) The role of constructs in psychological and educational measurement (pp. 73-95). London: Lawrence Erlbaum Associates, Publishers. Three propositions: To measure constructs with high referent generality it is necessary to use heterogeneous measurement devices. A homogenous test always measures several dimensions. To measure constructs with low referent generality it is also necessary to measure constructs with high generality.

Measurement from a hierarchical point of view, cont The principle of aggregation: Aggregation causes the general factor to account for a larger proportion of variance in the sum of scores than it does in each observed measure. Each observed variable is complex, but aggregate scores may be essentially unidimensional Aggregation over broad domains of performance is a way to approximate unidimensionality so that robust use of the Rasch model and other IRT models may still be possible.

Grading in Sweden In Sweden grades have always been high-stakes, because they have been the primary instrument for eligibility and selection to the next level of the educational system Teachers have always been trusted to grade their students Exams were abolished in the 1960s and standardized testing has traditionally had a comparably limited role Up until the mid 1990s the grading system was normreferenced, but in 1998 a criterion-referenced grading system was introduced.

The norm-referenced grading system The norm-referenced grading system was developed in the 1940s by Frits Wigforss (SOU 1942:11), after it had been observed that the grades used for admission to grammar school ( realskola ) lacked severely in comparability across schools and teachers The proposed system specified that grades should be normally distributed in the population, with a specified percentage of the students at each step of the grading scale So called standard tests were developed to guide the teachers grading at the class-level With the introduction of the comprehensive school ( grundskola ) in 1962 a five-step grading scale (1-5) was introduced, without any pass level

Critique of the norm-referenced grading system The norm-referenced grading system was criticized on many grounds: It inspired competition rather than cooperation It was unfair to students in different classes ( There are no 5s left ) Because the grade distribution was specified to be Normal (3,1) in the population the grades could not be used to describe change in levels of knowledge and skills Along with a curriculum reform in 1994, the norm-referenced grades were abolished, and a criterion-referenced system was introduced

The criterion-referenced grading system The system which was first put to use in 1998 had a scale with 4 steps: Pass with Special Distinction (MVG), Pass with Distinction (VG), Pass (G) and Fail (IG) In 2011 a new scale with six steps was introduced (A-F). F = fail According to the original plans, the number of failed students was expected be a few percentage points, but the first results showed the percentage of failed students to be much higher (9 %). It has since increased to 14 % The grading is guided by verbally formulated knowledge requirements for the different steps of the grading scale

Knowledge requirements (partial) for grades /E/C/A/ at the end of year 9 Grade E: Pupils can choose and use basically functional mathematical methods with some adaptation to the context in order to make calculations and solve routine tasks in arithmetic, algebra, geometry, probability, statistics, and also relationships and change with satisfactory results. Grade C: Pupils can choose and use appropriate mathematical methods with relatively good adaptation to the context in order to make calculations and solve routine tasks in arithmetic, algebra, geometry, probability, statistics, and also relationships and change with good results. Grade A: Pupils can choose and use appropriate and effective mathematical methods with good adaptation to the context in order to make calculations and solve routine tasks in arithmetic, algebra, geometry, probability, statistics, and also relationships and change with very good results.

Gustafsson, J.-E., Cliffordson, C., & Erickson. G. (2014). Likvärdig kunskapsbedömning i och av den svenska skolan problem och möjligheter [Equitable knowledge assessment in and of the Swedish school problems and possibilitties]. Stockholm: SNS Förlag Substantial grade inflation, particularly in upper secondary school Considerable variation in grading practices among teachers and schools Instability in the national tests across years and subjects These problems, along with several others, seem to be due to the lack of precision in the verbally formulated knowledge requirements for the different steps on the grading scale. Wigforss (1942) concluded that it is not possible to achieve sufficient comparability in grading based on verbally formulated criteria, which was why he developed the norm-referenced grading system.

Olsen, R.V., & Nilsen, T. (in press). Standard setting in PISA and TIMSS. In S. Blömeke & J.E. Gustafsson (Eds.), Standard Setting in Education - The Nordic Countries in an International Perspective, New York: Springer Publishing The authors compare and discuss similarities and differences in the way PISA and TIMSS set and formulate descriptions of standards or do scale anchoring (International Benchmarks in TIMSS based on a curriculum model; Proficiency Levels in PISA based on a competence model). Focus on the empirical basis for development of performancelevel descriptors. Their interest in standard setting and performance-level descriptions were partially driven by the fact that the Norwegian grading system has problems of comparability in grading.

An generic example of an item map (from Olsen & Nilsen, in press) Decide on the number and location of cut-scores to be used Develop Performance-Level Descriptors (PLDs) based on descriptions of the clusters of items identified and on the general description of the construct stated in the framework. This typically requires lots of items, given that the PLDs should not be formulated in item specific terms.

TIMSS International Benchmarks (partial) Low (400): Students have some knowledge of whole numbers and decimals, operations, and basic graphs. Intermediate: (475) Students can solve problems involving decimals, fractions, proportions, and percentages in a variety of settings. For example, they can determine proportions of a whole in order to construct pie charts and calculate unit prices to solve a problem. High (550): Students can use information from several sources to solve problems involving different types of numbers and operations. Students can relate fractions, decimals, and percents to each other. They can solve problems with fractions, proportions, and percentages. Students show understanding of whole number exponents. They can identify the prime factorization of a given number. Advanced (625): Students can solve a variety of fraction, proportion, and percent problems and justify their conclusions. They can reason with different types of numbers, including whole numbers, negative numbers, fractions, and percentages in abstract and non-routine situations. For example, given two points on a number line representing unspecified fractions, students can identify the point that represents their product.

PLDs and grading Empirically based PLDs could potentially provide a more stable foundation to support class-room based assessment and criterion-referenced grading than the currently used knowledge requirements, if formulated at an appropriate level of abstraction Linking national tests to PISA, TIMSS and the other international studies could provide a broader basis for constructing PLDs Dimensionality?

Individual differences versus development of competence The PLDs could, perhaps, be developed into empirically and theoretically based descriptions of learning trajectories, which could inform curricula, instruction and assessment The classical measurement models focus on individual differences: The notions of dimensionality, discrimination, reliability and validity are defined in terms of variance and covariance, and their application requires that the population of persons is defined. These models have limited applicability for the study of individual growth. The main aim of education is to support development of competence, so in educational measurement issues of development should be a central concern The tensions between differential and developmental psychology illustrate the difficulties to integrate research on individual differences and development. However, progress has lately been made through growth curve modeling and applications of IRT to solve measurement problems. Hopefully we will see more of integration of differential and developmental approaches in the future.