A Comparison of Traditional and IRT based Item Quality Criteria

Similar documents
Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Turning Output of Item Response Theory Data Analysis into Graphs with R

The Effect of Option Homogeneity in Multiple- Choice Items

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Construct Validity of Mathematics Test Items Using the Rasch Model

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Section 5. Field Test Analyses

REPORT. Technical Report: Item Characteristics. Jessica Masters

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College

Does factor indeterminacy matter in multi-dimensional item response theory?

A simple guide to IRT and Rasch 2

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Daniel Boduszek University of Huddersfield

Differential Item Functioning from a Compensatory-Noncompensatory Perspective

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience

METHODS. Participants

Item Analysis: Classical and Beyond

Author s response to reviews

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

World Academy of Science, Engineering and Technology International Journal of Psychological and Behavioral Sciences Vol:8, No:1, 2014

Lecture 15 Chapters 12&13 Relationships between Two Categorical Variables

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

1. Below is the output of a 2 (gender) x 3(music type) completely between subjects factorial ANOVA on stress ratings

Skala Stress. Putaran 1 Reliability. Case Processing Summary. N % Excluded a 0.0 Total

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Influences of IRT Item Attributes on Angoff Rater Judgments

Supplementary Material*

On the purpose of testing:

This is the publisher s copyrighted version of this article.

The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Quick start guide for using subscale reports produced by Integrity

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

Numeric Literacy in South Africa: Report. on the NIDS Wave 1 Numeracy Test. Technical Paper no. 5

Ascertaining the Credibility of Assessment Instruments through the Application of Item Response Theory: Perspective on the 2014 UTME Physics test

Item Analysis for Beginners

Development, Standardization and Application of

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Analysis of Confidence Rating Pilot Data: Executive Summary for the UKCAT Board

DISCPP (DISC Personality Profile) Psychometric Report

Evaluation of the Short-Form Health Survey (SF-36) Using the Rasch Model

Goal for Today. Making a Better Test: Using New Multiple Choice Question Varieties with Confidence

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Examining Construct Stability Across Career Stage Cohorts

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Properties of Single-Response and Double-Response Multiple-Choice Grammar Items

Examining the Psychometric Properties of The McQuaig Occupational Test

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Assessing the Validity and Reliability of Dichotomous Test Results Using Item Response Theory on a Group of First Year Engineering Students

Statistical Significance, Effect Size, and Practical Significance Eva Lawrence Guilford College October, 2017

AN ABSTRACT OF THE THESIS OF

The following is an example from the CCRSA:

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Approaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM)

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Item and response-category functioning of the Persian version of the KIDSCREEN-27: Rasch partial credit model

What Causes Stress in Malaysian Students and it Effect on Academic Performance: A case Revisited

Final Exam PS 217, Spring 2004

Description of components in tailored testing

Reviewing the TIMSS Advanced 2015 Achievement Item Statistics

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology

A Rasch Model Analysis on Secondary Students Statistical Reasoning Ability in Descriptive Statistics

Unit outcomes. Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2018 Creative Commons Attribution 4.0.

Unit outcomes. Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2018 Creative Commons Attribution 4.0.

Improving Measurement of Ambiguity Tolerance (AT) Among Teacher Candidates. Kent Rittschof Department of Curriculum, Foundations, & Reading

Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

ID# Exam 3 PS 306, Spring 2005 (You must use official ID#)

Interpersonal Citizenship Motivation: A Rating Scale Validity of Rasch Model Measurement

Exam 3 PS 306, Spring 2005

Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

Techniques for Explaining Item Response Theory to Stakeholder

Validation the Measures of Self-Directed Learning: Evidence from Confirmatory Factor Analysis and Multidimensional Item Response Analysis

On the Construct Validity of an Analytic Rating Scale for Speaking Assessment

IMPACT ON PARTICIPATION AND AUTONOMY QUESTIONNAIRE: INTERNAL SCALE VALIDITY OF THE SWEDISH VERSION FOR USE IN PEOPLE WITH SPINAL CORD INJURY

Essential Skills for Evidence-based Practice: Statistics for Therapy Questions

André Cyr and Alexander Davies

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Transcription:

A Comparison of Traditional and IRT based Item Quality Criteria Brian D. Bontempo, Ph.D. Mountain ment, Inc. Jerry Gorham, Ph.D. Pearson VUE April 7, 2006 A paper presented at the Annual Meeting of the National Council on ment in Education: San Francisco, CA

Executive Summary This study compared the survival rates of new items based on traditional item statistics (point-biserial correlation coefficient) to the survival rates of those based on IRT item statistics (fit statistics). The assessment data used in this study came from two (2) different large-scale, paper-and-pencil licensure and certification programs. Each exam s item pool contained at least 1,500 new items which were administered to at least 400 first-time test takers. Comparable traditional and IRT based item selection criteria (e.g., pt bis >= 0 or infit mean square fit < 2.0) were developed based on the survival rates and the percentage of items that are classified identically using each set of type of item selection criteria. Lastly, conditions, such as the difficulty of the item, that encourage or discourage the use of one type of item selection criteria over the other were explored. Introduction Many large scale assessment programs have switched from a classical test theory paradigm to an IRT based paradigm in recent years. However most of these programs still use classical test theory statistics such as the point-biserial correlation coefficient to evaluate the statistical quality of newly developed items. This research aims to provide IRT evaluation statistics in the form of infit and outfit Mean Square fit statistics that mimic the more traditional stats. The goal of this study is to illustrate how item survival rates can be used to develop a conversion table that will make it easy for the practitioner to convert their present CTT item evaluation criteria to IRT based criteria. Since it is well known that the point-biserial correlation coefficient and other similar classical test theory item statistics are less than ideal, it is NOT the goal of this study to develop IRT evaluation criteria that produce the same outcomes as the classical test theory criteria. Rather, it is the goal of this study to develop criteria that yield a similar item survival rate. This goal is important since many large-scale assessment programs base their item selection criteria on real world considerations such as the number or percentage of items that need to survive in order to construct the next version of the assessment. Data The data for this study came from two different large-scale, paper-an-pencil certification exams. The first exam (Exam A) contained 1,983 items that were assembled into 24 different overlapping test forms each containing 150 items. The other exam (Exam B) contained 1,632 items that were assembled into 13 different overlapping test forms each containing 180 items. Each form was administered to a sample of at least 400 representative first-time test-takers. Each of the exams was in the field during 2005. The characteristics of the set of items from each exam are displayed in Tables 1 and 2. The exams were similar. Both exams had a large range in their item difficulty. Mins were below -6 and maxs were above 4. Exam A was slightly more difficulty challenging and Exam B had a larger variance in item difficulty. One difference between the exams is that the set of items of Exam B was more diverse with respect to item quality. The range of point-to-measure correlation coefficients, infit, and outfit were all greater than Exam A and as was the variance of outfit. Table 1. Characteristics of Exam A items Exam A Min Max Mean Std. Dev Item Difficulty -6.43 6.14 0.00 1.38 PtMe -0.16 0.45 0.19 0.09 Infit Mean Square 0.89 1.16 0.99 0.04 Outfit Mean Square 0.52 1.58 0.97 0.10

Table 2. Characteristics of Exam B Items Exam B Min Max Mean Std. Dev Item Difficulty -6.46 4.81-0.04 1.64 PtMe -0.24 0.52 0.17 0.10 Infit Mean Square 0.87 1.19 1.00 0.04 Outfit Mean Square 0.25 2.45 0.99 0.12 Methodology For each of these exams, the item-level result (1/0 correct/incorrect result) was concurrently calibrated using the 1 PL dichotomous Rasch model using Winsteps. The point-to-measure correlation coefficient without inclusion was calculated for each item as well as the infit mean square and outfit mean square statistics. The calculation of each of these statistics was easily conducted by using the options available in Winsteps. The viability of each item from a traditional classical test theory perspective was determined by running each item through a point-to-measure correlation coefficient item selection filter. Since licensure programs vary widely in the stringency of the filtration, three levels of filtration were applied: Low Item Performance Benchmark (Low Level of Filtration) o Point to > 0.0 Medium Item Performance Benchmark (Medium Level of Filtration) o Point to > 0.05 High Item Performance Benchmark (High Level of Filtration) o Point to >.1 Under each one of these levels of filtration, the survival rate of the items was calculated. Using these survival rates, a comparable set of IRT filtration criteria (based solely on the infit and outfit MNSQ values) were determined. Specifically, four (4) IRT filters were applied a.) heavier filtration on infit and less on outfit b.) heavier filtration on outfit and less on infit c.) a balanced approach to both infit and outfit d.) a compensatory approach where the sum of the infit and outfit mean square was used. After the comparable IRT filter values had been determined, the items that were filtered out were queried for further comparison. Table 3 and 4 display the survival rates of exam for the five item quality criteria. Inspection of these tables reveals that it was not possible to determine the criteria so that the exact survival rate was obtained. This was due to the fact that the item quality statistics were rounded to the hundredths place, logical for these types of statistics, which created a discrete frequency distribution. In lieu of an exact match, the closest survival rate was utilized. One interesting finding was that the unbalanced approach that favored heavier filtration on outfit was unable to yield a survival rate that approximated the balanced approach. The survival rate was always lower (the failure rate was always higher). Therefore, this technique was abandoned.

Table 3. Item Survival Rates for Exam A PtMe Percent Flagged Number Flagged Low Filtration >=0 1.7% 33 Med Filtration >=.05 4.2% 84 High Filtration >=.1 12.3% 243 Sumfit Low Filtration <=2.24 34 Med Filtration <=2.17 80 High Filtration <=2.11 250 Balanced Infit Outfit Low Filtration <1.18 <1.18 32 Med Filtration <1.12 <1.12 87 High Filtration <1.07 <1.07 258 Heavy Infit Light Outfit Infit Outfit Low Filtration <1.10 <1.23 36 Med Filtration <1.08 <1.15 86 High Filtration <1.05 <1.10 245 Light Infit Heavy Outfit Infit Outfit Low Filtration <1.19 <1.17 36 Med Filtration <1.13 <1.11 105 High Filtration <1.08 <1.06 331

Table 4. Item Survival Rates for Exam B PtMe Percent Flagged Number Flagged Low Filtration >=0 3.4% 55 Med Filtration >=.05 10.7% 175 High Filtration >=.1 23.9% 390 Sumfit Low Filtration 2.24 58 Med Filtration 2.14 185 High Filtration 2.07 395 Balanced Infit Outfit Low Filtration <1.18 <1.18 56 Med Filtration <1.10 <1.10 187 High Filtration <1.05 <1.05 416 Heavy Infit Light Outfit Infit Outfit Low Filtration <1.10 <1.23 52 Med Filtration <1.07 <1.15 184 High Filtration <1.03 <1.10 387 Light Infit Heavy Outfit Infit Outfit Low Filtration <1.19 <1.17 70 Med Filtration <1.11 <1.09 224 High Filtration <1.06 <1.04 464 Analysis For each set of filtered items, the percentage of items that were flagged by one method and not by the other were identified as well as the percentage of flagged items that were flagged by both methods. These statistics are displayed for Exam A in tables 5-10 and for For Exam B in Table 11-16. For Exam A, the methods that were the most similar were the sum of infit and outfit mean square and the heavy infit or light outfit. Over 90% of the items flagged by one of these methods were also flagged by the other method as well. That is except when the filtration level was low, in which case only 75% of the items flagged by one method were also flagged by the other. For Exam A, the most dissimilar methods were the infit or outfit filtration when one method was balanced and the other was heavy on infit and light on outfit. For the lowest level of filtration, only 56% of the items flagged by one method were also flagged by the other. For Exam B, the most similar method was the sum of infit and outfit and the infit or outfit (balanced). For the highest level of filtration 94% of the items flagged by one method were also flagged by another. As with Exam A, over 90% of the items flagged by sumfit (sum of infit and outfit) were also flagged by the infit or outfit (balanced) rule. That is except when the filtration level was low, in which case 83% of the items flagged by one method were also flagged by another. All pairwise method comparisons yielded at least 60% of the items flagged by one method were also flagged by the other.

Table 5. Exam A Survival Rate Similarity Pointto- and Sumfit Infit or Outfit Balanced Flagged 66% 36% Surviving 34% Flagged 72% 25% Surviving 28% Flagged 75% 20% Surviving 25% Table 6. Exam A Survival Rate Similarity between and Sumfit Sum of Infit and Outfit Flagged 82% 15% Surviving 18% Table 8. Exam A Survival Rate Similarity between Sumfit and Infit or Outfit Balanced Sum of Infit and Outfit Mean Square Infit or Outfit Balanced Flagged 81% 24% Surviving 19% Flagged 80% 20% Surviving 20% Flagged 82% 1% Surviving 18% Table 9. Exam A Survival Rate Similarity between Sumfit and Heavy Infit or Light Outfit Flagged 75% 21% Surviving 25% Flagged 82% 15% Surviving 18% Sum of Infit and Outfit Flagged 91% 10% Mean Square Surviving 9% Flagged 83% 28% Surviving 17% Table 7. Exam A Survival Rate Similarity between and Heavy Infit or Light Outfit Flagged 69% 24% Surviving 31% Flagged 94% 4% Surviving 6% Table 10. Exam A Survival Rate Similarity between Infit or Outfit Balanced and Heavy Infit or Light Outfit Flagged 56% 38% Surviving 44% Flagged 84% 14% Surviving 16% Infit or Outfit Balanced Flagged 73% 28% Surviving 27% Flagged 70% 30% Surviving 30% Flagged 81% 23% Surviving 19%

Table 11. Exam B Survival Rate Similarity Pointto- and Sumfit Infit or Outfit Balanced Flagged 66% 33% Surviving 34% Flagged 71% 25% Surviving 29% Flagged 67% 28% Surviving 33% Table 12. Exam B Survival Rate Similarity between and Sumfit Sum of Infit and Outfit Flagged 82% 16% Surviving 18% Flagged 68% 28% Surviving 32% Flagged 68% 31% Surviving 32% Table 13. Exam B Survival Rate Similarity between and Heavy Infit or Light Outfit Flagged 71% 33% Surviving 29% Table 14. Exam B Survival Rate Similarity between Sumfit and Infit or Outfit Balanced Sum of Infit and Outfit Mean Square Infit or Outfit Balanced Flagged 73% 27% Surviving 27% Flagged 84% 15% Surviving 16% Flagged 93% 2% Surviving 7% Table 15. Exam B Survival Rate Similarity between Sumfit and Heavy Infit or Light Outfit Sum of Infit and Outfit Mean Square Flagged 83% 23% Surviving 17% Flagged 91% 10% Surviving 9% Flagged 92% 10% Surviving 8% Table 16. Exam B Survival Rate Similarity between Infit or Outfit Balanced and Heavy Infit or Light Outfit Flagged 63% 41% Surviving 37% Flagged 68% 29% Surviving 32% Infit Balanced Flagged 76% 25% Surviving 24% Flagged 64% 36% Surviving 36% Flagged 90% 16% Surviving 10%

Each of the items was plotted on a graph (see Figure 1 and 2) to illustrate the relationship between the difficulty of the item (in logits), the point-to-measure correlation coefficient, and the functionality of the low filtration flagging criteria. The main finding from these graphs is that the sumfit rule was better than the infit or outfit rule at detecting those above average difficulty items that had negative point-to-measure correlation coefficients. The same pattern was evident for the medium and high level of filtration. Figure 1. Exam A plotted against Item Difficulty by Flagging Rule.5.4.3.2.1.0 -.1 -.2 -.3 -.4 -.5-8 -6-4 -2 0 2 4 6 8 Flagged by Both Flagged by Infit or Outfit Flagged by sumfit Not Flagged Item Difficulty

Figure 2. Exam B plotted against Item Difficulty by Flagging Rule.5.4 Point to.3.2.1.0 -.1 -.2 -.3 -.4 -.5-7 -6-5 -4-3 -2-1 0 1 2 3 4 5 6 7 Flagged by Both Flagged by Infit or Outfit Flagged by Sumfit Not Flagged Item Difficulty Conclusion This was a short and simple study that used item survival rates as the benchmark by which to carry over traditional classical test theory item quality statistics to the item response theory world. Item survival rates are commonly used and discussed by item developers and psychometricians. In the real world of large-scale licensure and certification assessment, item survival rates are fairly stable from one item development effort to the next. For programs interested in transitioning from classical test theory item quality criteria to item response theory criteria, the most important question asked by program stakeholders is How will this transition effect the item survival rate? By starting with this question, we were able to adjust the flagging criteria values until we were certain that item survival rates would not change. Of the four different IRT methods tested, the sum of infit and outfit means square yielded the most similar outcomes to the point-to-measure correlation coefficient. Programs interested in mimicking classical test theory may wish to use this method. Programs adopting this method will still benefit from calculating outfit and infit since these stats individually can be used to troubleshoot and revise weak items to bolster their viability. Since each method has statistical merits and weaknesses, future research should aim to probe the perceived quality of each method by investigating the questionable items. That is, subject matter experts should review the set of item flagged by one method and not by the other and determine which pile of questionable items is more desirable. In conclusion, licensure and certification programs that are still clinging to classical test theory based statistics such as the point-biserial correlation coefficient, should use item survival rates and the method outlined in this study, when modifying their item quality selection criteria. By doing so, programs will be able to choose a method of filtration that best suits their program while also ensuring item pool stability.