A Broad-Range Tailored Test of Verbal Ability

Similar documents
Description of components in tailored testing

Bayesian Tailored Testing and the Influence

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Computerized Mastery Testing

The Use of Item Statistics in the Calibration of an Item Bank

A Comparison of Several Goodness-of-Fit Statistics

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Adaptive EAP Estimation of Ability

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

Psychological testing

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Using Bayesian Decision Theory to

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

IRT Parameter Estimates

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Ehmhhmhhmmhhu. mllu AD-RA44 65 COMPUTER-BASED MEASUREMENT OF INTELLECTUAL CAPABILITIES (U) MINNESOTA UNIV MINNEAPOLIS DEPT OF PSYCHOLOGY

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

Item Analysis Explanation

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Item Analysis: Classical and Beyond

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

The Effect of Guessing on Item Reliability

was also my mentor, teacher, colleague, and friend. It is tempting to review John Horn s main contributions to the field of intelligence by

12_. A Procedure for Decision Making Using Tailored Testing MARK D. RECKASE

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

linking in educational measurement: Taking differential motivation into account 1

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Technical Specifications

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

V. Measuring, Diagnosing, and Perhaps Understanding Objects

An Investigation of vertical scaling with item response theory using a multistage testing framework

Rasch Versus Birnbaum: New Arguments in an Old Debate

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

ITEM ANALYSIS OF MID-TRIMESTER TEST PAPER AND ITS IMPLICATIONS

Scaling TOWES and Linking to IALS

Development, Standardization and Application of

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

Computerized Adaptive Testing With the Bifactor Model

An Introduction to Missing Data in the Context of Differential Item Functioning

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Effects of Local Item Dependence

Cognitive Design Principles and the Successful Performer: A Study on Spatial Ability

Item Selection in Polytomous CAT

Utilizing the NIH Patient-Reported Outcomes Measurement Information System

Gender-Based Differential Item Performance in English Usage Items

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2

Unit 1 Exploring and Understanding Data

How Many Options do Multiple-Choice Questions Really Have?

Reliability, validity, and all that jazz

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

A Bayesian Nonparametric Model Fit statistic of Item Response Models

Context of Best Subset Regression

A Memory Model for Decision Processes in Pigeons

International Journal of Education and Research Vol. 5 No. 5 May 2017

Strategies for computerized adaptive testing: golden section search, dichotomous search, and Z- score strategies

BACKGROUND CHARACTERISTICS OF EXAMINEES SHOWING UNUSUAL TEST BEHAVIOR ON THE GRADUATE RECORD EXAMINATIONS

VALIDITY STUDIES WITH A TEST OF HIGH-LEVEL REASONING

Running head: THE DEVELOPMENT AND PILOTING OF AN ONLINE IQ TEST. The Development and Piloting of an Online IQ Test. Examination number:

Item Bias in a Test of Reading Comprehension

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

Assessment with Multiple-Choice Questions in Medical Education: Arguments for Selected-Response Formats

Clustering Autism Cases on Social Functioning

Item Writing Guide for the National Board for Certification of Hospice and Palliative Nurses

Computerized Adaptive Testing

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

Appendix B Statistical Methods

Self-Efficacy in the Prediction of Academic Performance and Perceived Career Options

Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods

Learning with Rare Cases and Small Disjuncts

LEARNING-SET OUTCOME IN SECOND-ORDER CONDITIONAL DISCRIMINATIONS

Response and Multiple-Choice Tests

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

Parallel Forms for Diagnostic Purpose

The Personal Profile System 2800 Series Research Report

Variations in Mean Response Times for Questions on the. Computer-Adaptive General Test: Implications for Fair Assessment

Psychological testing

Discrimination Weighting on a Multiple Choice Exam

Adaptive Measurement of Individual Change

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Verbal Reasoning: Technical information

Centre for Education Research and Policy

Multidimensionality and Item Bias

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

Reliability, validity, and all that jazz

LEDYARD R TUCKER AND CHARLES LEWIS

Individual differences in working memory capacity and divided attention in dichotic listening

Convergence Principles: Information in the Answer

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Goodness of Pattern and Pattern Uncertainty 1

A Race Model of Perceptual Forced Choice Reaction Time

Running Head: ADVERSE IMPACT. Significance Tests and Confidence Intervals for the Adverse Impact Ratio. Scott B. Morris

Transcription:

A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from fifth grade through graduate school. Simulated test administrations indicate that the 25- item tailored test is at least as good as a comparable 50-item conventional test. At most ability levels, the tailored test measures much better. An offer is made to provide upon request item characteristic curve parameters for 690 widely used Cooperative Test items, in order to facilitate research. This report describes briefly a broad-range tailored test of verbal ability, appropriate at any level from fifth grade upwards, through graduate school. The test score places persons at all levels directly on the same score scale. In a tailored test, the items administered to an individual are chosen for their effectiveness for measuring him. Items administered later in the test are selected by computer, according to some rule based on the individual s performance on the items administered to him earlier. Improved measurement is obtained by 1) matching item difficulty to the ability level of the individual and 2) using the more discriminating items in the available item pool. The matching of test difficulty to the individual s ability level is advantageous and desirable for psychological reasons. For references on tailored testing, see Wood (1973); also Cliff (1975), Jensema (1974a, 1974b), Killcross (1974), Mussio (1973), Spineti and Hambleton (1975), Urry (1974a, 1974b), Waters (1974), Betz and Weiss (1974), DeWitt and Weiss (1974), Larkin and Weiss (1974), McBride and Weiss (1974), Weiss (1973, 974), Weiss and Betz (1973). Test Structure The broad-range test consists of 182 verbal items. These were chosen from all levels of Cooperative Tests SCAT and STEP, from the College Entrance Examination Board s Preliminary Scholastic Aptitude Test, and from the Graduate Record Examination. The choice was made solely on the basis of item type and difficulty level. There was no attempt to secure the best items by selecting on item discriminating power. Two parallel forms of this 182-item tailored test were constructed. Only one of these forms is considered here. The 182 items in a single form of the test are represented in Table 1, where they are arranged in columns by difficulty level. An individual answers just one item in each row of the table-a total of just 25 items. There are five verbal item types, denoted by a, b, c, d, e. Within each item type, the items in each column are arranged in 95

97 order of discriminating power with the best items at the top. Ideally there should be only one item type in each row, so that all examinees would take the same number of items of each type. The arrangement of Table 1 is an attempt to approximate this ideal using the items available. (Few if any hard items of types a and e were in the total pool; there were also few if any easy items of types b and c. Types a and b, also types c and e, seem fairly similar.) The examinee starts with an item in the first row. The difficulty level of this item is determined by the examinee s grade level, or some other rough estimate of his ability. If he answers the first item correctly, he next takes an item in the second row that is harder than (to the right of) the first item. If he answers the first item incorrectly, he next takes an item in the second row that is easier than (to the left of) the first item. He may continue with the third and subsequent rows, moving to the right after each correct answer, or to the left after each incorrect answer, until he has at least one right answer and at least one wrong answer. At this point, the computer uses item characteristic curve theory to compute the maximum likelihood estimate of the examinee s ability level. In effect, the computer asks: For what ability level is the likelihood of the observed pattern of responses at a maximum, taking into account the difficulty and other characteristics of the items administered up to this point? The ability level that maximizes this likelihood is the current estimate of the examinee s ability. From this point on, the next item to be administered will be of the same item type as the item in the next row that best matches in difficulty the examinee s estimated ability level. Given this item type, we survey all items of this type and administer next the item that gives the most information at his estimated ability level. After each new response by the examinee, his ability is re-estimated. The item type of the next item is determined, as above, and the best item (not already used) of that type is chosen and administered. This continues until he has answered 25 items, one for each row of the table. The maximum likelihood estimate of his ability determined from his responses to all 25 items is his final verbal ability score. According to the item characteristic curve model, all such scores for various examinees are automatically on the same ability scale, regardless of which set of items was administered. Results About thirty different designs for a broadrange tailored test of verbal ability were tried out on the computer, administering each one to a thousand or so simulated examinees. The final design was recently chosen and has not yet been implemented on the computer for administration to real flesh-and-blood examinees. Entry point effects. Consider first the effect of the difficulty level of the first item administered. The vertical dimension in Figure 1 represents the standard error of measurement of obtained test scores on the broad-range tailored test, computed by a monte carlo study. Each symbol shows how the standard error of measurement varies with ability level (horizontal axis). The four symbols represent the results obtained with four different starting points. The points marked + were obtained when the difficulty level of the first item administered was near -1.0 on the horizontal scale-about fifth-grade level. The small dots represent the results when the difficulty level of the first item was near 0-about ninth-grade level. For the hexagons, it was near.75-near the average verbal ability level of college applicants taking the College Enrance Examination Board s Scholastic Aptitude Test. For the points marked by an x, it was near 1.5. For any given ability level, the standard error of measurement varies surprisingly little, considering the extreme variation in starting item difficulty. Item pool structure. Various designs were also tried out with more columns or with fewer than the 10 columns shown in Table 1. A test with 20 columns, spanning roughly the same difficulty

- 98 Figure 1. The standard error of measurement at 13 different ability levels for four different starting points for the 25-item broad-range tailored test. range as Table 1 but requiring 363 items, was found to be at least twice as good as the 10- column 182-item test of Table 1. The reason for this is not that the columns in Table 1 are too far apart, but mainly that selecting the best items (best for a particular individual) from a 363-item pool will give a much better 25-item test than selecting the same number of items from a smaller, 182-item pool. Still better tests could be produced by using still larger item pools, even though only 25 items are administered to each examinee. Comparison with conventional test. It is important to compare the broad-range tailored test with a conventional test, such as the Preliminary Scholastic Aptitude Test of the College Entrance Examination Board. Figure 2 shows the information function for the Verbal score on each of three forms of the PSAT adjusted to a test length of just 25 items. Figure 2 also shows the information function for the Verbal score on the broad-range tailored test, which administers just 25 items to each examinee. The tailored test shown in Figure 2 corresponds to the hexagons of Figure 1, since they represent the results obtained when the first item administered is at a difficulty level appropriate for average college applicants. The PSAT information functions are computed from estimated item parameters. For points spaced along the ability scale, the tailored test information function is estimated from the test responses of simulated examinees. It is encouraging, but not surprising, to find that the tailored test is at least twice as good as a 25-item conventional PSAT at almost all ability levels. After all, at the same time that we are tailoring the test to fit the individual, we are taking advantage of the large item pool, using the best 25 items available within certain restrictions already mentioned concerning item type. It would, of course, be desirable to confirm this evaluation by extensive test administrations, using flesh-and-blood examinees instead of simulated examinees. Item Pool Availability In conclusion, the writer would like to make an offer that should enable research workers When the test score is an unbiased estimator of ability, the information function is simply the reciprocal of the squared standard error of measurement. A k-fold increase in information may be interpreted as the kind of increase that would a conventional test k-fold. be obtained by lengthening

99 Figure 2. Information function for the 25-item tailored test, also for three forms of the Preliminary Scholastic Aptitude Test (dotted lines) adjusted to a test length of 25 items. and graduate students to conveniently design and build actual tailored tests and administer them to real examinees. On written request from suitably qualified individuals, he will provide estimated item parameters for the verbal items in any or all of the following Cooperative Tests: SCAT II, Forms la, 2A, 2B, 3A, 3B, 4A (50 items each); STEP II, Reading Test, Part I only, Forms 2A, 2B, 3A, 3B, 4A (30 items each); SCAT I, Forms 2A, 2B, 3A, 3B (60 items each). This represents a pool of 690 calibrated verbal items available for research or other purposes. (This offer expires when better methods for estimating item parameters have been developed-very soon, it is hoped.) References Betz, N. E., & Weiss, D. J. Simulation studies of twostage ability testing. Research Report 74-4. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1974. (AD A001230). Cliff, N. Complete orders from incomplete data: interactive ordering and tailored testing. Psychological Bulletin, 1975, 82, 289-302. De Witt, L. J., & Weiss, D. J. A computer software system for adaptive ability measurement. Research Report 74-1. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1974. (AD 773961). Jensema, C. J. An application of latent trait mental test theory. British Journal of Mathematical and Statistical Psychology, 1974, 27, 29-48. (a) Jensema, C. J. The validity of Bayesian tailored testing. Educational and Psychological Measurement, 1974, 34, 757-766. (b)

100 Killcross, M. C. A tailored testing system for selection and allocation in the British Army. A paper presented at the 18th International Congress of Applied Psychology, Montreal, August 1974. Larkin, K. C., & Weiss, D. J. An empirical investigation of computer-administered pyramidal ability testing. Research Report 74-3. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1974. (AD 783553). McBride, J. R., & Weiss, D. J. A word knowledge item pool for adaptive ability measurement. Research Report 74-2. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1974. (AD 781894). Mussio, J. J. A modification to Lord s model for tailored tests. Unpublished doctoral dissertation, University of Toronto, 1973. Spineti, J. P., & Hambleton, R. K. A computer simulation study of tailored testing strategies for objective-based instructional programs. Unpublished manuscript, University of Massachusetts, 1975. Urry, V. W. Approximations to item parameters of mental test models and their uses. Educational and Psychological Measurement, 1974, 34, 253-269. (a) Urry, V. W. Computer assisted testing: the calibration and evaluation of the verbal ability bank. Technical Study 74-3. Washington, D.C.: Personnel Research and Development Center, U.S. Civil Service Commission, 1974. (b) Waters, B. K. An empirical investigation of the stradaptive testing model for the measurement of human ability. Unpublished doctoral dissertation, The Florida State University, 1974. Weiss, D. J. The stratified adaptive computerized ability test. Research Report 73-3. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1973. (AD 768376). Weiss, D. J. Strategies of adaptive ability measurement. Research Report 74-5. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1974. (AD A004270). Weiss, D. J., & Betz, N. E. Ability measurement: conventional or adaptive? Research Report 73-1. Minneapolis, Minn.: Psychometric Methods Program, Department of Psychology, University of Minnesota, 1973. (AD 757788). Wood, R. Response-contingent testing. Review of Educational Research, 1973, 43, 529-544. Author s Address Frederic M. Lord, Educational Testing Service, Princeton, N.J. 08540 Harry H. Ha rm an Memorial Fund In memory of Harry H. Harman who died on June 8, 1976, his friends and colleagues at Educational Testing Service have established the Harry H. Harman Loan Fund for Students in Quantitative Psychology and Educational Measurement at the University of Chicago. Contributions to this fund may be made by sending a check, made out to &dquo;the University of Chicago&dquo; to R. Darrell Bock, 5835 S. Kimbark Avenue, Chicago, Illinois 60637, and indicating that it is for the Harry H. Harman Memorial Loan Fund. Such gifts are tax deductible.