INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Similar documents
THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

A Comparison of Several Goodness-of-Fit Statistics

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

Technical Specifications

Psychological testing

linking in educational measurement: Taking differential motivation into account 1

Differential Item Functioning

25. EXPLAINING VALIDITYAND RELIABILITY

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

By Gordon Welty Wright State University Dayton, OH 45435

26:010:557 / 26:620:557 Social Science Research Methods

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Further Properties of the Priority Rule

The Research Process: Coming to Terms

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

The Psychometric Principles Maximizing the quality of assessment

Rasch Versus Birnbaum: New Arguments in an Old Debate

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

André Cyr and Alexander Davies

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

Chapter 2. The Research Process: Coming to Terms Pearson Prentice Hall, Salkind. 1

Does factor indeterminacy matter in multi-dimensional item response theory?

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Description of components in tailored testing

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

RECALL OF PAIRED-ASSOCIATES AS A FUNCTION OF OVERT AND COVERT REHEARSAL PROCEDURES TECHNICAL REPORT NO. 114 PSYCHOLOGY SERIES

How to interpret results of metaanalysis

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

Denny Borsboom Jaap van Heerden Gideon J. Mellenbergh

Lecture Notes Module 2

Sociological Research Methods and Techniques Alan S.Berger 1

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measurement and Evaluation Issues. with PISA

Introduction On Assessing Agreement With Continuous Measurement

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Clever Hans the horse could do simple math and spell out the answers to simple questions. He wasn t always correct, but he was most of the time.

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

11/24/2017. Do not imply a cause-and-effect relationship

Behavioral EQ MULTI-RATER PROFILE. Prepared for: By: Session: 22 Jul Madeline Bertrand. Sample Organization

Chapter Three: Sampling Methods

Underlying Theory & Basic Issues

Item Analysis: Classical and Beyond

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

To conclude, a theory of error must be a theory of the interaction between human performance variability and the situational constraints.

Lesson 9: Two Factor ANOVAS

V. Measuring, Diagnosing, and Perhaps Understanding Objects

ADMS Sampling Technique and Survey Studies

Overview of the Logic and Language of Psychology Research

TTI Personal Talent Skills Inventory Coaching Report

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

INTERNATIONAL STANDARD ON ASSURANCE ENGAGEMENTS 3000 ASSURANCE ENGAGEMENTS OTHER THAN AUDITS OR REVIEWS OF HISTORICAL FINANCIAL INFORMATION CONTENTS

Bayesian Tailored Testing and the Influence

computation and interpretation of indices of reliability. In

Chapter 11 Nonexperimental Quantitative Research Steps in Nonexperimental Research

1 The conceptual underpinnings of statistical power

Unit 1 Exploring and Understanding Data

Color naming and color matching: A reply to Kuehni and Hardin

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

EXPERIMENTAL DESIGN Page 1 of 11. relationships between certain events in the environment and the occurrence of particular

Estimating the Validity of a

Selecting a research method

Chapter 02 Developing and Evaluating Theories of Behavior

THE EMOTIONAL INTELLIGENCE ATTRIBUTE INDEX

26. THE STEPS TO MEASUREMENT

Development, Standardization and Application of

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

The Regression-Discontinuity Design

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

John Broome, Weighing Lives Oxford: Oxford University Press, ix + 278pp.

Reliability & Validity Dr. Sudip Chaudhuri

Models in Educational Measurement

Multidimensionality and Item Bias

Why do Psychologists Perform Research?

CEMO RESEARCH PROGRAM

Misleading Postevent Information and the Memory Impairment Hypothesis: Comment on Belli and Reply to Tversky and Tuchin

Assurance Engagements Other than Audits or Review of Historical Financial Statements

APPLYING THE RASCH MODEL TO PSYCHO-SOCIAL MEASUREMENT A PRACTICAL APPROACH

Critical Thinking and Reading Lecture 15

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

University of Groningen. Mastering (with) a handicap Kunnen, Elske

Multifactor Confirmatory Factor Analysis

ISA 540, Auditing Accounting Estimates, Including Fair Value Accounting Estimates, and Related Disclosures Issues and Task Force Recommendations

For her own breakfast she ll project a scheme Nor take her tea without a stratagem. --- Edward Young ( ) Mathematics of a Lady Tasting Tea

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

In this chapter we discuss validity issues for quantitative research and for qualitative research.

Structural Equation Modeling (SEM)

Item Writing Guide for the National Board for Certification of Hospice and Palliative Nurses

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

26:010:557 / 26:620:557 Social Science Research Methods

Bayesian and Frequentist Approaches

Examining the Psychometric Properties of The McQuaig Occupational Test

UNIT 1. THE DIGNITY OF THE PERSON

Verbal Reasoning: Technical information

Transcription:

INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement are made always involve a multitude of factors. Successful measurement depends on achieving sufficient control with respect to the observations taken, so that their variations, dominated by the positions of persons (and items) differ along a single variable. Even though persons differ in many ways, their measurement becomes possible when one of these dimensions dominates the behavior provided by the items used for measuring. Analogously, even when items differ on a number of dimensions, this can be used for measuring if the responses of persons are dominated by only one of these dimensions. Thus measurement can succeed inspite of multidimensionality when the multidimensionality is not shared actively by both persons and items. We will illustrate this with some examples. Case I: Two types of items Suppose we wish to measure "general mental ability," and to do this, we construct an instrument containing both reading and math items. While this instrument might be considered two dimensional, measurement with it could succeed in situations where either 1) there is no variable which affects the probability of success on the reading items differently than on the math i tems, 2) or math ability and reading ability are so highly correlated in the population that they do not appear different.

2 In either case we should not care whether the measurement were made entirely with reading items, entirely with math items, or any mixture in between, since all items would measure the "same" variable. In the first case, there is only one variable. In the second, there are two but since they are highly correlated They act as one and we can measure math ability with reading items and reading ability with math items, if we choose. It does not matter whether we call the resulting measure math, reading or general ability. However, if we try to assert that both types of items are necessary for a "fair" measurement and become involved in setting the correct proportion of each, we have admitted the multidimensionality of the situation and should instead measure the two variables separately with items appropriate to each. It is only possible to measure a person, who always has many different abilities on one variable by carefully constructing an instrument which addresses just that one variable. We may sometimes get by with a multidimensional instrument, since the two alternatives above--one variable versus two highly correlated variables--are not distinguishable in data, but, when we use an instrument of items readily classifiable into two or more types, its (effective) unidimensionality must be corroborated for each new sample. Case II: One type of item with extraneous variables A contrasting case can be illustrated by considering the measurement of, say, problem solving ability with an instrument composed of written word problems. Proficiency on this instrument requires many abilities in addition to problem solving, not the least of which is the ability to read the

3 language in which the problems are written. If the reading ability of every person is well above the "readability" of the problems, differences among items or persons in this respect will not affect performance on the instruement. However, if any person has difficulty reading the problems, his measure of problem solving ability will be biased downward by this extraneous factor. His probability of success will be influenced by the interaction between his reading ability and the readability of the problem. This can, of course, be eliminated by carefully regulating readability to be beneath the reeding ability of the target population. This case differs from the preceding one in that each item has a "difficulty" on two variables. As long as all persons are sufficiently able readers, the instrument can be used to measure problem solving ability. Theoretically, at least, such an instrument could also be used to measure reading ability among very able problem solvers who were poor readers. Random guessing on multiple-choice items is another example of extraneous variation. Persons succeed on difficult items more often than their abilities predict. This makes them appear more able when more difficult items are administered since their success rate does not decrease as difficulty increases. A similar but opposite effect occurs when able persons become careless with easy items, making them appear less able than they are. Such items "measure" two variables--the ability of interest and the tendency to guess or to become careless. The "guessingness" of the item may or may not be a simple

4 function of the difficulty on the main variable but for the person two different variables are involved. The measure of either variable is threatened by the presence of the other. These forms of multidimensionality have in common that different subsets of the items produce non-equivalent estimates of person ability and different subsamples of personsproduce different estimates of item difficulties. This contradicts the Rasch requirement that ability measures be independent of the items administered and item calibrations be independent of the persons used. In order to see how to avoid such disruptions we need to study possible sources of disturbance to develop analytic procedures that detect and assess the importance of disruptions when present. Unequal Item Discriminations No discussion of disturbances in Rasch measurement is complete without metion of item "discrimination." Rasch's derivation of what is required in order to achieve objectivity (i.e., measures of person ability that are freed from the particular sets of items administered), and calibrations of items that are freed from the particular samples of persons used, lead to amodel which rules against a parameter for item discrimination. If measurement objectivity is to be achieved the situation must be arranged so that a parameter for discrimination is not necessary. When the problem is approached from other perspectives, for example, when the observations are considered so inviolate and valuable that the data allowed to determine the form of the model, regardless of the effect on the measurement process, item discrimination is almost always

5 included as a parameter. A model with an additional parameter, such as discrimination, will always recover the observed data more precisely than one without, but it is not at all clear when that is done that status the resulting "estimates" of discrimination can have in our thinking about the generalizable and reproducible attributes of the situation. It remains to be settled whether discimination "estimates" pertain to a stable, meaningful parameter that characterizes future outcomes of similar situations or whether they are only transiently useful as a descriptive statistic for diagnosing the trouble in one set of observations. There are two distinctly different situations which would lead to unequal discriminations. First, the items may be influenced to different extents by factors other than the variables of interest. For example, with the problem solving test, if the items vary in readability and their readability is near enough to the reading level of the persons so that some persons have reading difficulty with some of the items, then the items will appear to vary in their power to discriminate along the problem solving scale. Although were the second variable centered, this apparent variation in discrimination would disappear. Items which no one is able to read will have no relationship to problem solving ability and items which everyone reads without difficulty will have the strongest relationship. Hence, the highest "discriminations" will be associated with items that are only influenced by the variable of interest

-6 and the lowest will be for items most influenced by other factors. ' Alternatively, discriminations could vary if the items differ in the amount of random fluctuation associated with them. It should not he surprising that a completion item which requires the person to recall the correct response, discriminates more sharply than a multiple-choice item, which requires the person merely to recognize the correct answer. Recognition items give the person who does not recall, or even recognize, the correct response the opportunity to eliminate responses he knows to be incorrect, thereby increasing his probability of choosing the correct one. If his success at this is related to his position on the latent variable, not to his test-wiseness or any other extraneous factor, intelligent guessing of this sort need not interfere with the measurement process but does suggest the two types of items are of different quality. More generally, items which require dramatically different behaviors from the person could be reasonably expected to vary in their discriminations. It would be rather surprising if, "application" items related to their variable with the same precision as "knowledge" items. This suggests that it should be possible to distinguish items of unequal quality, if the inequality is due to the precision inherent in an item type, thus avoiding the necessity of parameterizing discrimination 1 This assumes that items influenced by problem solving ability are numerous enought to dominate the operational definition of the variable measured by the instrument. Otherwise the variables would be a mixture of reading and problem solving and both types of items might discriminate poorly.

7 in the measurement model. If the inequality is due to the subtle presence of extraneous variables, the effect is not that of an item parameter but rather a transitory attribute of the local situation. This source of multidimensionality is a problem for any latent trait model and will always require a technique for discovering and controlling it. The requirements of the Rasch model are little different than the recognized rules of good test construction. They are more explicit and more defensible because of their relationship to clear philosophy of measurement. We will exploit the unique properties of the Rasch model to discover and evaluate disruptions in the measurement process. Fitting the model, and analyzing the observed residuals from it, in the difficulty metric, allows us to specify many useful questions in the familiar form of linear models. Since the Rasch model is the simplest possible latent trait model, this analysis gives the data the greatest possible opportunity to reveal important occasions of multidimensionality. Motivation for the Use of Rasch Fit Analysis Rasch (1960) and Wright (1967) have persuasively argued the advantages of specific objectivity in measurement over alternative approaches. With it, it is possible to estimate item parameters that are independent of the ability level of the calibrating sample and to obtain valid comparisons among persons regardless of the particular items administered. The mathematical form that the model necessarily takes to achieve this also eliminates the estimation and design problems that have traditionally troubled test users. (Andersen, 1972; Wright and Douglas, 1975)

8 The "cost" of these advantages is the exchange of assumptions about the distribution of person ability that is necessary for obtaining estimates of parameters with other latent trait models for the assumption of equal item discrimination implied by the Rasch model. Superficially this would seem a poor exchange; we are more apt to have a reasonable idea about the distribution than we are to have equal discriminations. However, the items unlike the Unknown distributions of person abilities, are under the control of the test constructor and with adequate resources, items can be developed which have relatively homogeneous discriminations. There is also an important advantage in using the Rasch model as the basis of the analysis of fit that is distinct from its value as a measurement model. A more complex model will adapt itself more completely to specific sets of data and so obscure potentially interesting aspects of the situation which represent anomalies in the measurement process. Including discrimination as a parameter will seem to "explain" many occurences of multidimensionality which in contrast will appear clearly as misfit with the Rasch model. Since when discrimination is used as a parameter there will appear to be no misfit present in these cases, we will be less likely to investigate further and will miss an opportunity to discover the reason for the interaction between persons and items. This too convenient mathematical "explanation" of lack of fit will also jeopardize the validity of future applications of the instrument.

9 The comparison with classical item analysis approaches is still more striking. The requirements necessary to use either the classical or Rasch approaches are essentially the same except that the Rasch model assumes the shape of the regression of the observed outcome on ability is logistic rather than linear. The advantages are the invariant Rasch paramaters and a rigorous mathematical model of the process which permits systematic evaluation of the data's fit to the model. The requirements governing the use of the Rasch model are simply the logical extensions of traditional requirements but now facilitated by the explicitness of the model. The traditional selection criteria are limited to attributing poor measurement to characteristics of items. Since it is difficult to address the issue of fit, item selection rules are limited to improving the total test reliability (Lord and Novick, 1968, Chap. 15). This is maximized if the "difficulty" (i.e., the proportion correct) for every item is 0.5 and the item discriminations (or biserial corelations) are as large as possible. These criteria have several problems. An observed proportion correct of 0.50 refers to the applicability of the item to the mean of the distribution of persons. The item will still be "inappropriate" for persons in the tails of this distribution. In many applications, these people are our major concern. Selecting items with high discriminations may also be misleading since exceptionally high discriminations are frequently due to the local and passing influence of an extraneous factor. Including such items

- 10- reduces the instruments reliability and validity in future applications. Finally, if we are totally successful in obtaining items of 0.5 difficulty and discriminations near one, the instrument of whatever number of items will end up functioning as a single item (Tucker, 1946) as far as measurement is concerned since each item will provide exactly the same information. Fit analysis (and test design) based on the Rasch model incorporates and refines both of these rules. We can, if we have some notion of a person's ability, select items which minimize the standard error of measurement (i.e., maximize precision, reliability) for that person individually. This allows us to tailor the instrument to best achieve the objective of any measurement application. If we are chiefly interested in a particular person, we can design the test for him. If we are interested in a population we can design the test to best cover that range of ability or if we are interested in some external criterion, we can design the instrument to measure in that region. In no case are w& dependent on the proportion of some calibrating sample that succeeded on the item. The tests of fit for the Rasch model will reject items with unusually low or unusually high discriminations. This screens out items that are weak or potentially misleading measures of the variables, as defined by the remaining items. It protects us from including items that are vulnerable to the influence of extraneous factors which happen to be related to the variable of interest in our sample. It selects from a set of items- t hat-w e -t h ink are relevant to our variable a homogeneous subset that can be reasonably accepted as measuring a single variable in the given sample in a generalizable way.