An Investigation of vertical scaling with item response theory using a multistage testing framework

Size: px

Start display at page:

Download "An Investigation of vertical scaling with item response theory using a multistage testing framework"

Heather Richardson
5 years ago
Views:

University of Iowa Iowa Research Online Theses and Dissertations 2008 An Investigation of vertical scaling with item response theory using a multistage testing

edu/etd/216 Recommended Citation Beard, Jonathan James. "An Investigation of vertical scaling with item response theory using a multistage testing framework.

1 University of Iowa Iowa Research Online Theses and Dissertations 2008 An Investigation of vertical scaling with item response theory using a multistage testing framework Jonathan James Beard University of Iowa Copyright 2008 Jonathan James Beard This dissertation is available at Iowa Research Online: Recommended Citation Beard, Jonathan James. "An Investigation of vertical scaling with item response theory using a multistage testing framework." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 AN INVESTIGATION OF VERTICAL SCALING WITH ITEM RESPONSE THEORY USING A MULTISTAGE TESTING FRAMEWORK by Jonathan James Beard An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa December 2008 Thesis Supervisor: Associate Professor Timothy Ansley

3 1 ABSTRACT A simulation study was carried out to assess the effects of using different testing frameworks and different statistical estimators in constructing a vertical scale. The adaptive multistage testing framework (MST) was comprised of five test forms which were administered across three testing occasions. The single form testing framework (SFT) was comprised of one form at each of the three testing occasions. Maximum likelihood estimation (MLE) and Bayesian Expected a-posteriori (EAP) estimators were used to estimate each simulee s ability at three testing occasions. Item response theory (IRT) true scores, or domain scores, were used as the score scale. This was done to facilitate the use of growth scores between testing occasions. It was hypothesized that testing framework and estimation procedures would influence the recovery of the known domain score for each simulee across the three testing occasions and growth values between testing occasions. Average absolute deviation (AAD) values indicated that the MST framework offered a slight reduction in error when compared to the SFT framework in estimating IRT domain scores. The pattern of errors in estimation indicated that the MST framework provided more accurate estimates across the range of ability. The MST framework also offered a slight reduction in error when estimating IRT growth scores. Horizontal distances between test administrations indicted that EAP estimation produced uneven departures from known horizontal distances, but MLE did not. This was true for both the SFT and MST framework. Also, when the distributions of IRT

4 2 domain scores were considered, the MLE estimation method was more consistent with the distribution of known domain scores. Overall, the MST framework performed better than did the SFT framework with respect to reduced estimation error and approximating the known IRT domain score. Abstract Approved: Thesis Supervisor Title and Department Date

5 AN INVESTIGATION OF VERTICAL SCALING WITH ITEM RESPONSE THEORY USING A MULTISTAGE TESTING FRAMEWORK by Jonathan James Beard A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa December 2008 Thesis Supervisor: Associate Professor Timothy Ansley

7 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Jonathan James Beard has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the December 2008 graduation. Thesis Committee: Timothy Ansley, Thesis Supervisor Stephen Dunbar Michael Kolen Lelia Helms Won-Chan Lee

8 To James, Kate, Floyd, and Virginia ii

9 ACKNOWLEDGEMENTS I have always found myself to have the great fortune to be surrounded by those who are more gifted and talented than myself. The time I have spent completing this study is no exception. There are many who deserve my heartfelt thanks. I first wish to thank the patient and wise words offered by my advisor and thesis supervisor, Dr. Tim Ansley. Without him, this project could not have been completed. I would also like to thank the other members of my committee: Dr. Stephen Dunbar, Dr. Michael Kolen, Dr. Lelia Helms, and Dr. Won-Chan Lee. Their guidance served as a model for future academic thinking and writing. I want to thank my parents for providing with the drive to persevere. I also wish to thank my friends and companions that have helped me along the way. I would especially like to acknowledge the help of the following people: Tom Proctor, Kyong-Hee Chon, Tawnya Knupp, Scott Wood, Michelle Mengeling, Paul Westrick and Liz Hollingworth. A very special word of thanks is extended to Melissa Chapman and David Haynes. Above all, I would like to thank my wife, Michelle. She gives meaning to everything I do. iii

10 ABSTRACT A simulation study was carried out to assess the effects of using different testing frameworks and different statistical estimators in constructing a vertical scale. The adaptive multistage testing framework (MST) was comprised of five test forms which were administered across three testing occasions. The single form testing framework (SFT) was comprised of one form at each of the three testing occasions. Maximum likelihood estimation (MLE) and Bayesian Expected a-posteriori (EAP) estimators were used to estimate each simulee s ability at three testing occasions. Item response theory (IRT) true scores, or domain scores, were used as the score scale. This was done to facilitate the use of growth scores between testing occasions. It was hypothesized that testing framework and estimation procedures would influence the recovery of the known domain score for each simulee across the three testing occasions and growth values between testing occasions. Average absolute deviation (AAD) values indicated that the MST framework offered a slight reduction in error when compared to the SFT framework in estimating IRT domain scores. The pattern of errors in estimation indicated that the MST framework provided more accurate estimates across the range of ability. The MST framework also offered a slight reduction in error when estimating IRT growth scores. Horizontal distances between test administrations indicted that EAP estimation produced uneven departures from known horizontal distances, but MLE did not. This was true for both the SFT and MST framework. Also, when the distributions of IRT iv

11 domain scores were considered, the MLE estimation method was more consistent with the distribution of known domain scores. Overall, the MST framework performed better than did the SFT framework with respect to reduced estimation error and approximating the known IRT domain score. v

12 TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x CHAPTER 1 INTRODUCTION Adaptive Testing Vertical Scaling Vertical Scaling with Adaptive Tests Purpose of the Study Research Questions LITERATURE REVIEW Adaptive Testing Complete Test Form Adaptation Multistage Testing Computerized Adaptive Testing Adaptive Multistage Testing Summary Vertical Scaling Data Collection Growth Item Response Theory Scaling Parameterization Response Type Dimensionality Summary Assumptions Parameter Estimation Effects of Different Estimators Defining IRT Scale Relationships Linear Transformations Characteristic Curve Transformations Concurrent Calibration Vertical Scaling Using IRT Summary vi

13 2.8 NELS Testing Procedures IRT True Score Scaling METHODOLOGY Simulation Procedures Simulation of Known Abilities Generation of Item Responses Selection of Appropriate Items Ability Parameter Estimation MLE Estimation EAP Estimation Multiple Group Estimation BILOG-MG Procedures Summary Analysis Evaluation of IRT Domain Scores Evaluation of Growth Distance Measures Summary Research Hypotheses RESULTS Item Parameters and Simulated Values Test Characteristics and Item Parameters Simulated Ability Values Simulated Domain Score Values Estimated Values Estimated Ability Values Estimated Domain Score Values Estimated Growth Score Values Recovery of Scores Recovery of IRT Domain Scores Recovery of IRT Growth Scores Differences in Growth EAP Score Investigation Results Summary DISCUSSION Summary and Discussion IRT Domain Scores IRT Growth Scores vii

14 APPENDIX Separation of Distributions Summary of Findings Implications of the Study Limitations of the Study Future Research Conclusions A DESCRIPTION OF KNOWN ITEM PARAMETERS B ABILITY AND TRUE SCORE DISTRIBUTIONS C IRT TRUE GROWTH SCORE DISTRIBUTIONS D IRT TRUE SCORE COMPARISONS E DIFFERENCES IN HORIZONTAL DISTANCES ACROSS TESTING FRAMEWORKS AND ESTIMATION METHODS F ABILITY VALUES FROM JOINT ESTIMATION G BILOG-MG SYNTAX FILES H ABILITY VALUES FROM SEPARATE ESTIMATION REFERENCES viii

15 LIST OF TABLES Table Page 3.1 Statistics for IRT Domain Scores from NELS Statistics for Simulated Ability Values Statistics for Simulated IRT Domain Scores and IRT Domain Scores from NELS Descriptive Statistics for Estimated Ability Values Descriptive Statistics for IRT Domain Score Values Descriptive Statistics for Growth Score Values AAD for IRT Domain Scores Average Absolute Deviation for IRT Growth Scores Horizontal Distances at Selected Percentiles Effect Sizes Horizontal Distances for IRT Domain Scores Derived from Estimated Item Parameters using EAP Estimation in the MST Framework A.1 Item Parameters Used to Generate Response Data A.2 Summary of Item Parameters in Reading Item Pool A.3 Summary of Item Parameters and Information by Administration ix

16 LIST OF FIGURES Figure Page 1.1 Representation of Single Form Testing Sequence Representation of Adaptive Multistage Testing Sequence Three Item Characteristic Curves A.1 MST Base Year Test Information Curve A.2 MST First Follow-up Test Information Curves A.3 MST Second Follow-up Test Information Curves A.4 SFT First, Second, and Third Test Information Curves B.1 Known Ability Distributions at Each Testing Occasion B.2 Known IRT True Scores at Each Testing Occasion B.3 True Score Distributions From Observed NELS Data at Each Testing Occasion B.4 MST EAP Ability Distributions at Each Testing Occasion B.5 MST MLE Ability Distributions at Each Testing Occasion B.6 SFT EAP Ability Distributions at Each Testing Occasion B.7 SFT MLE Ability Distributions at Each Testing Occasion B.8 MST EAP True Score Distributions at Each Testing Occasion B.9 MST MLE True Score Distributions at Each Testing Occasion B.10 SFT EAP True Score Distributions at Each Testing Occasion B.11 SFT MLE True Score Distributions at Each Testing Occasion C.1 NELS Growth Score Distributions x

17 C.2 Simulated Growth Score Distributions C.3 MST EAP Growth Score Distributions C.4 MST MLE Growth Score Distributions C.5 SFT EAP Growth Score Distributions C.6 SFT MLE Growth Score Distributions D.1 MST EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.2 MST EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.3 MST EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.4 MST MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.5 MST MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.6 MST MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.7 SFT EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.8 SFT EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.9 SFT EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.10 SFT MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.11 SFT MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time xi

18 D.12 SFT MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time E.1 Differences in the Estimated First Horizontal Distance (ĤD) from Known Horizontal Distances (HD) for Each Testing Method and Statistical Estimator E.2 Differences in the Estimated Second Horizontal Distance (ĤD) from Known Horizontal Distances (HD) for Each Testing Method and Statistical Estimator F.1 MST EAP Ability Distributions at Each Testing Occasion based on Simultaneous Estimation of Ability and Item Parameters H.1 MST EAP Ability Distributions at Each Testing Occasion based on Separate Estimation of Ability and Item Parameters H.2 SFT EAP Ability Distributions at Each Testing Occasion based on Separate Estimation of Ability and Item Parameters xii

19 1 CHAPTER 1 INTRODUCTION The desire of educational policymakers in general and educational practitioners in particular to follow students development is a strong one (Seltzer et al., 1994). Many educational policies in effect today are premised on the belief that a score can be placed on a metric that facilitates inferences as to how much and how well a student understands content within a particular domain. Indeed, the requirement imposed by the No Child Left Behind Act of 2001 (NCLB, PL ) that students be tested in several grades in several subjects contributes to the notion that academic progress can and should be measured. However, the development of useful educational information is dependent upon a variety of factors, including the desired method of testing, what inferences are to be made based upon the results of the testing, and accountability constraints that must be accommodated. Two issues that relate directly to measuring and reporting student progress through the educational system are adaptive testing and vertical scaling. Both offer potential benefits to test developers and test score users. When used together, they may provide tractable solutions to complex testing problems. 1.1 Adaptive Testing Adaptive testing is a process that seeks to provide an efficient estimate of a test taker s ability using fewer items, with similar reliability, when compared to a traditional paper and pencil test. Adaptive testing can take place at the item level,

20 2 testlet (groups of items) level, and as a complete test. Item level adaptation is usually carried out through computerized adaptive testing (CAT), and has consistently been found to provide the lowest standard error of measurement of any testing mode (Weiss, 1982; Reese and Schnipke, 1999; Schnipke and Reese, 1999). Testlet level adaptation can be carried out using CAT or paper and pencil multistage tests. Complete test form adaptation has historically been carried out by using out-of-level testing. Typically done to match the ability of the test taker, it is usually administered on an individual basis at the discretion of teachers or other educational professionals (Plake and Hoover, 1979). 1.2 Vertical Scaling Vertical scaling seeks to provide a common metric between two tests that are similar in content, but different with respect to overall difficulty. The process of developing a vertical scale can accommodate different data collection designs, as well as different assumptions upon which the vertical scale is based. The resulting scale is useful for answering questions that are most concerned with change in test scores. This change is usually described as growth. Although vertical scales are useful for measuring growth in academic achievement, constructing a vertical scale can be difficult. It requires the use of judgment, a knowledge of statistical techniques and their assumptions, familiarity with benefits and tradeoffs, and an acceptance of the limitations placed upon the procedures. A typical vertical scaling would be carried out by having two groups of stu-

21 3 dents that differ in average ability, and each group would take a test that is specifically designed to represent the appropriate level of difficulty. Therefore, each test is appropriate for the intended group, but when the tests compared to each other, the test for the higher ability group is more difficult than the test for the lower ability group. For example, if a vertical scale were to be constructed to describe math problem solving skills, each group would have a test assessing the appropriate content of math problem solving. There would also be items on each test that establish a link between groups. The difference between the two groups performance on the common items would establish the nature of the math problem solving scale. Much research has been devoted to investigating properties of vertical scales, and the issues of mean growth trajectory and variability of scores across ability levels are the most debated with respect to their influence on vertical scaling methods. This debate was particularly energetic when Thurstone, item response theory (IRT), and Hieronymus methods were compared to each other, with studies showing that the growth that is implied by these methods differs widely (Hoover, 1984a; Burket, 1984; Hoover, 1984b; Clemans, 1993; Yen et al., 1996b; Clemans, 1996; Yen et al., 1996a; Seltzer et al., 1994; Schulz and Nicewander, 1997; Williams et al., 1998). Some of the studies carried out different scalings on the same data. These studies are addressed more fully in the literature review, but it is important to recognize that differences in average mean growth and score variability across test administrations will likely influence the outcome of a vertical scale.

22 4 Figure 1.1: Representation of Single Form Testing Sequence 1.3 Vertical Scaling with Adaptive Tests A traditional procedure of establishing a vertical scale would use a single test form for different groups (say grades, for example). The difficulty of the forms at each grade would progressively increase, such that each form is designed to be appropriate for the level in which it is administered. Vertical scaling establishes a link between these forms such that scores from any of the forms has direct meaning with respect to the domain of interest. A representation of this testing sequence can be seen in Figure 1.1. Vertically scaling these types of forms could be done using item response theory (IRT) or Thurstone methods. Adding to the complexity of developing a vertical scale is the use of different test forms within a test administration that are tailored to the ability of the testtaker. The additional complexity arises when there are two (or possibly more) forms that are used within a particular testing administration where each form is designed to measure the same construct, but is different in overall difficulty. Thus, not only are forms different in difficulty between administrations, but the multiple forms within an administration are also different in difficulty.

23 5 Figure 1.2: Representation of Adaptive Multistage Testing Sequence A specific framework of IRT applied to the processes of adaptive multistage testing and vertical scaling simultaneously addresses these differences among test form difficulty and group ability. This process involves the repeated assessment of a cohort of students over a number of years, and tailoring follow-up tests based upon prior estimates of ability. A representation of this testing sequence can be seen in Figure 1.2. Combining the desirable aspects of test adaptability and commonality of scale has the potential to provide defensible solutions to complex testing problems. Following the illustration in Figure 1.2, the adaptive framework would result in different groups of students taking different tests at time 2. Assignment to the easy (E 1 ) or difficult (D 1 ) form would depend on the result of the earlier administered base-year form (B). If students scored above some critical value of θ 1, these more able students would receive the difficult test. The less able students, scoring below the critical value of θ 1, would receive the easier test. If another testing occasion were to occur, assignment to the form used at time 3 would similarly be based on the score a student received at time 2 (θ 2 ).

24 6 In order for the change in scores from time 1 to time 2 to have meaning, the scale must accommodate differences in test difficulty across administration times and within a test administration. In order to produce a meaningful vertical scale, each form at time 3, each form at time 2, and the base-year form must be on the same scale. Therefore, the ultimate goal is to have a scale that will relate the ability values (θ 1, θ 2, and θ 3 ) in the same way regardless of which forms a particular student takes. Vertical scaling in an adaptive multistage framework seeks to place the scores from these tests on the same scale, and developing a vertical scale that simultaneously accommodates scores from all of the forms to a common metric is the focus of interest here. Several national studies have used a longitudinal, multistage testing framework similar to the one described above. Three notable studies are: National Educational Longitudinal Study (NELS, 1988), Education Longitudinal Study (ELS, 2002), and the Early Childhood Longitudinal Study (ECLS). Each of these studies provides a wealth of information that serves as the foundation for research in multiple academic disciplines, including sociology, psychology, education, educational policy and policy evaluation. Included in the data is test score information that is based upon several testing occasions. Adaptive multistage tests have been used in all of these data collections. As an example, the NELS:88 base-year test for Reading was the same for all students. Based upon the performance on the base-year exam, students received either an easy or a difficult form for the first follow-up. Similarly, based upon the

25 7 performance on the first follow-up, students received either an easy or hard form for the second follow-up. For Mathematics, three forms (easy, medium, and hard) were used. Estimates of student proficiency were provided for three points in time (specifically, for years 1988, 1990, and 1992). It is these scores that are of interest to evaluators and researchers who are interested in measuring change. The scores that are recommended for investigations into change or growth by NCES are called IRT estimated number right scores. These scores are the number of items a student would be likely to get correct if he or she had taken every item in the calibrated item pool. This value is calculated by applying an estimate of ability to each item, calculating the probability of answering the item correctly (based on an IRT model), and adding these probability values across items. These values have been called IRT true scores (Lord, 1980; Yen and Fitzpatrick, 2006) and IRT domain scores (Bock, 1997; Bock et al., 1997). This metric is useful because it is on the number correct scale (even though the values are fractional) and change scores can be described in terms of an increase in expected number right. For this study, the term domain score will be used. This term is appropriate because the development of a particular student is anchored to the particular set of items that comprise the domain, and changes in scores are based on the same sample of items in the domain. The workability of vertical scaling in an adaptive multistage testing framework is usually heavily dependent on the assumptions of item response theory. This framework capitalizes on two very specific benefits of IRT. The first is called the item-free property of ability parameters and the second is called the person-free property of

26 8 item parameters. The item-free property states that a person s ability is independent of the particular sample of items used. Thus, if one set of thirty items was chosen from a calibrated item pool, and a second set of thirty items was chosen from the same pool, the estimate of ability from each test should differ only by errors of measurement if the test taker were to take both forms (Lord, 1980; Harris and Hoover, 1987; Hambleton, 1989; Hambleton et al., 1991). Likewise, statistical characteristics of items are independent of the particular sample drawn from a population used to calibrate the items. This person-free property means that if a set of thirty items was administered to one sample of test takers, and the same set of thirty items was administered to another set of similar test takers, the resulting item parameter estimates for the same items are expected to be linearly related (Lord, 1980). These two properties are collectively known as the invariance properties of IRT. Harris and Hoover note that if these properties held, they would essentially solve the problem of vertical [scaling] (1987). If these invariance properties hold, the logical question may be asked as to why any type of adaptation would be necessary. If the location and scale of two different ability estimates have been linked, why is it beneficial to give different groups of people different sets of items? The answer is because of the precision of an ability estimate. The precision that different sets of items provide will differ over the range of ability. The purpose of adapting a test to a previous estimate of ability is to minimize this error in measurement. This is especially important when accurate estimation is desired at the upper and lower range of ability.

27 9 1.4 Purpose of the Study Vertical scaling in an adaptive multistage framework seeks to place scores from tests that are intentionally different in difficulty, and taken by groups that are intentionally different in ability, on the same scale. The reasons for investigating this process are more than academic; real consequences may be attached to vertical scaling results. It is imperative that when evaluative decisions are made about program efficacy, school funding, or policy directives based upon student test scores, those scores should be provided in the most defensible manner possible. This issue may also be especially true with respect to high-stakes decisions regarding certification or licensing. Combining the desirable aspects of vertical scaling with adaptive testing should provide scores that are accurate, defensible, and appropriate. The focus of this dissertation is to investigate the effect of using an adaptive multistage testing framework with item response theory scaling techniques to establish a vertical scale. The issue of importance is whether vertical scaling using an adaptive multistage testing framework influences the accuracy of measuring ability and ability growth. It is important to understand the effect of using an adaptive multistage testing framework when vertically scaling achievement test data because it seems entirely plausible that using forms which are intentionally different in difficulty may influence the nature of the vertical scale, and questions of appropriate interpretation of the data will likely arise. Large scale data collections, like those mentioned earlier, provide educational data for research purposes that will be used in a vast array of disciplines. It is likely

28 10 that statements about the effectiveness of interventions, programs, or other issues surrounding educational performance of students will occur. Whether the adaptive multistage framework might influence any statement or conclusion about the effectiveness of schools, teachers, or even principals, is unknown. Currently, no research exists that investigates the use of IRT in a longitudinal data collection design that incorporates an adaptive multistage testing framework in a vertical scaling context. This study seeks to fill this gap in the literature. Multiple studies have addressed the use of IRT when constructing a vertical scale, but none has systematically investigated the possible benefits of an adaptive multistage testing framework. The literature demonstrates that the desire to follow students progress is likely to persist, and the use of IRT procedures is likely to continue. Additionally, the desire to accommodate a test to the ability of the test taker seems to be reasonable, so long as scores from the different forms can be placed in relation to each other in a defensible manner. Using an an adaptive multistage testing framework may provide a tractable solution to competing demands on testing technology: accuracy of ability estimation, comparability of scale scores across a developmental range, and effectually accommodating testing to the ability of the test taker Research Questions Specific research questions were formulated such that the each question would address a specific issue when using IRT to develop a vertical scale in an adaptive

29 11 multistage testing framework. The research questions for this study are: 1. Does the use of an adaptive multistage testing framework influence the recovery of IRT domain score values? 2. Does the use of different statistical estimators influence the recovery of IRT domain score values? 3. Does the use of an adaptive multistage testing framework influence the recovery of IRT true growth score values? 4. Does the use of different statistical estimators influence the recovery of IRT true growth score values? 5. Do differences in testing frameworks and statistical estimators influence growth across the range of score distributions?

30 12 CHAPTER 2 LITERATURE REVIEW This chapter provides an introduction to several issues related to adaptive testing and vertical scaling. The first section will address general aspects of adaptive testing, including adaptive testing situations that are designed for longitudinal analysis. The next section includes a discussion of vertical scaling. Item response theory is discussed next. Following this is a discussion of methods that establish a common IRT metric for different test administrations. Lastly, a review of studies investigating vertical scaling using item response theory is presented. 2.1 Adaptive Testing Adaptive testing, in its purest sense, is not a new idea. The desire to accurately gauge the level of performance of a student, without overburdening either the test taker or the administrator has been around since tests were first given (Wainer and Kiely, 1987; Wainer et al., 2000). One example of early adaptive testing would be Binet s intelligence test. Weiss (1985; 1982) notes that Binet s test possessed the four general rudiments of an adaptive testing framework: 1. The starting point on the test was variable (based on a previous estimate of ability). 2. Items were scored as they were administered. 3. A following item was chosen based on the score on the preceding item.

31 13 4. Testing was stopped based on a predefined termination rule. Unfortunately, individualized testing is not a feasible option for implementing efficient large-scale procedures. However, it seems that developments in testing technology can provide an approximation to the ultimate goal of individualized assessment: targeting questions to the ability of the test taker and reliably measuring that ability to some pre-specified degree of precision. There are three general types of adaptation that will be introduced here: computerized adaptive testing (CAT), multistage testing (MST), and complete or whole test form adaptation. Each has benefits over traditional paper and pencil (P&P) testing, and burdens that P&P testing does not have. Most comparisons of CAT are made against a similarly situated P&P test (same content, same population of test takers, etc.). These traditional tests are usually described as linear tests. To be effective in its purpose of measuring a large group of heterogeneous test takers accurately, a linear P&P test is usually designed to cover a broad range of ability. The test design also generally includes a few difficult items for more able students; it has a few easy items for less able students, and a vast majority of items are designed to measure a particular trait well at the middle range of ability (Wainer et al., 2000). Although this framework is good, it may have some consequences that are unintentional, but nonetheless potentially influential for test takers at either extreme of the ability range. Less able test takers may become frustrated, demoralized, and possibly hostile to the testing situation. Highly able students may become bored,

32 14 distracted, and possibly dismissive of the testing situation. Neither of these results is desirable, and adaptive testing procedures might offer superior alternatives Complete Test Form Adaptation Out-of-level testing (also called functional testing or off-level testing) is a process by which a student is given a test form that is different from the majority of other students in his or her grade. The purpose of this practice is to make the relationship between the student s ability and the difficulty of the test less disparate. This can mean giving a more able student a higher level of the test, or a less able student a lower level of the test. In essence, it is believed that the resulting test score from an off-level test would be more accurate... since [students] are no longer guessing or answering carelessly (Minnema et al., 2000). If scaling procedures have been carried out correctly, the overall impact of allowing a student to test out-of-level should be minimal. Although administering a test out-of-level is quite easy, the substantive meaning and interpretations garnered from an off-level test score are not as easily understood. Ayrer and McNamara (1973) studied the effects of off-level testing in a Philadelphia school district. They noted that with an increase of students taking an off-level test, the mean GE for that particular grade was lowered by about.3 over the course of two years. Long, Shaffran, and Kellog (1977) administered the on-grade and off-grade level of the Gates-MacGinitie Vocabulary and Comprehension subtests to a sample

33 15 of students who were one or more grade levels behind in reading based on scores from another test of vocabulary. Students were assigned an on-level test, and an off-level test based upon scores on the Botel Word Opposites Test. The order in which the onlevel and off-level tests were administered was counterbalanced to avoid introducing any order effect. In this study, students in grade 2 could take a test that was one grade behind; students in grade 3 could take forms designed for grades one or two; and students in grade 4 could take forms designed for grades one, two, or three. Results indicated that for grades 2 and 3, the off-level tests consistently produced higher GE scores. For grade 4, the off-level tests produced systematically lower GE scores. These results are problematic, especially with respect to test score reporting for accountability purposes. If an off-level administration of a test is designed to provide a more accurate estimate of ability or achievement, the resulting scores should not have been as disparate. These results also indicate that teachers, policymakers, and other stakeholders, would seek from the measurement profession a manner of scaling that would produce scores which could more easily be compared across grades, especially when issues of accountability and funding are considered Multistage Testing The decision to use multistage testing involves weighing the relative burdens and benefits that are pertinent to a particular testing situation. A multistage testing framework offers some of the major benefits of CAT, especially when compared to linear tests.

34 16 Multistage testing can take place as a CAT, or it can take place using paper and pencil as the mode of administration. The main distinction of using a multistage test (MST) versus a traditional CAT is that adaptation of the testing environment to the ability of the test taker does not occur at the item level. Adaptation occurs at the testlet level, which offers distinct advantages over traditional linear tests, and some advantages and drawbacks when compared to a traditional item level CAT. When compared to an item level CAT, the MST offers the advantages of greater control over test construction (balancing of content domains), a more plausible assumption of item independence (between testlets), increased control of item ordering, allowing for item review within a testlet, and fewer data management demands (Hendrickson, 2007). However, there are some disadvantages of an MST when compared to an item level CAT. One disadvantage is that more items are generally needed to reach the same level of precision. The use of testlets also has bearing on the test development process. A testlet must be developed as a whole, and this may be a greater burden with respect to item development. Also, if a particular item within a testlet begins to function poorly, there is no known procedure in which it could be replaced while retaining the original functionality of the testlet Computerized Adaptive Testing Computerized adaptive testing necessarily requires development of computer software and compatible hardware, but there are several other aspects that must be considered before a fully operational system is in working order. Weiss and Kingsbury

35 17 (1984) notes the several components: 1. Item response model: Parametric and non-parametric models are available for use. The important issue is to make sure that the model chosen is the most appropriate for the data available. 2. Item pool: The number of items to be made available to the test taker must be large enough to avoid item overexposure, and be varied enough to allow precise measurement throughout the range of ability. Also, the items must be calibrated to be on the same scale. 3. Entry level: The location of an item along the item difficulty continuum that begins a CAT session can be changed. The first item given to a test taker is usually set around the middle difficulty range, with subsequent items adapting to the particular pattern of responses made by the test taker. The difficulty of the item at the starting point generally does not have a large impact on a person s score. 4. Item selection rule: Along with adjusting for difficulty and the pattern of responses of the test taker, issues of process and content balance may enter into item selection. An item may enter into a testing sequence that may not be the most statistically optimal, but the item covers a domain that may not have been satisfactorily measured. Two general statistical procedures are used to sequentially select items. The first is the maximum information function (when MLE is used), and the second

36 18 is the minimum posterior variance (when Bayesian estimation is used). Because both procedures are related to the information function, the items selected by both procedures are usually quite similar. 5. Scoring method: Two general methods are used in scoring responses in a CAT: Maximum likelihood and Bayesian. Bayesian estimation can generally accommodate response patterns that would be troublesome under maximum likelihood scoring. However, Bayesian estimation regresses estimated ability on prior estimates of ability. Maximum likelihood estimation is asymptotically unbiased, thus producing the smallest variance for unbiased estimators. Bayesian estimation produces biased estimates of ability, but usually has a smaller variance than MLE. Weiss and Kingsbury (1984) suggest a compromise of using both estimators to capitalize on the benefits of each. 6. Termination rule: When using either MLE or Bayesian methods, the decision to end the testing session must be governed by some criterion. Some tests are fixed length, where a pre-specified number of items are given. Other tests stop after the error of measurement has become sufficiently small to provide a reasonable point estimate and narrow range of the resulting test score. Some criteria may be specified with respect to some cut score, and testing stops if a person s score is sufficiently above or below the score to declare him or her a master or non-master.

37 19 There are several benefits of CAT. Although the realization of full benefits had to wait until developments in technology could allow them. With the advent of sophisticated desktop computers, those benefits became a reality. The major benefits of computerized adaptive testing are: Efficiency Adaptive testing generally produces a test that is at least as reliable as a linear test with considerably fewer items being administered. Precision Adaptive tests generally provide greater precision (i.e., less measurement error) over a larger range of the ability continuum when compared to a linear test. Appropriateness Adaptive testing tries to align the difficulty of the items chosen to the best estimate of the ability of a test taker. The issue of having to answer very hard and very easy items is avoided. Immediacy Test scores are immediately available to a test taker. Customizability Given particular testing constraints, tests can be given on a more flexible schedule; tests could proceed at a test taker s own pace; items that seem to be troublesome can be removed; and novel item formats can be used. 2.2 Adaptive Multistage Testing Most implementations of adaptive testing are concerned with accurate measurement of a particular trait or ability at one particular point in time. This is true whether the mode of administration is adaptive at the item level, testlet level, or mul-

38 20 tistage level. This is to be expected, especially given the purposes for which adaptive testing has been applied (e.g., admissions, certification, etc.). Most studies of multistage testing also apply the process of ability estimation to one particular point in time. However, the methods that are used to estimate ability for one administration may be successfully applied to monitoring change in ability over time. One early study of multistage testing was carried out by Linn, Rock, and Cleary (1969) using the SCAT and STEP tests. Although the procedures were not based on IRT methods, the information used to develop the two-stage forms approximated the idea of producing a reliable estimate of ability with fewer items. An index relating the ratio of the number of items on a conventional test needed to match the validity coefficient (between these tests and the PSAT Verbal and Math tests) was used as a criterion. The authors found that the value for two-stage tests was 3.36 based on separate group regressions, and 2.33 for a regression based on a common slope. This study indicates that even when items were grouped based upon classical item level statistics, tailoring items to the ability of the test taker was beneficial. Lord (1971) investigated the application of different adaptation procedures in adaptive testing. He noted that this application was a break from earlier two-stage testing used in personnel decision making, where borderline examinees took a second exam that would facilitate a better classification decision of selection or rejection. In the new application, the second level of testing would apply to all test takers, and the added utility of this is more precise measurement based upon current estimates of ability. The main results from this study demonstrated that different adaptive

39 21 methods provide greater information at different points along the ability range. As Lord expected, the adaptive procedure produced less error variance at the high and low end of the ability range. Bock and Zimowski (1998) carried out a feasibility study of two-stage testing as it might be implemented in the National Assessment of Educational Progress (NAEP). A test of science in four areas (earth science, biology, chemistry, and physics) was given to a sample of secondary students in Ohio. Tests were assembled based upon science items received from an earlier request for science-based items to state, provincial, and national testing programs. The first-stage test was administered in January or February of 1991, and the second-stage test was administered in April or May. Assignment to the secondstage form was based upon: the subdomain number correct score, the number of courses in the subdomain, and the number correct score for the total test. Random (assumedly parallel) forms were constructed within a difficulty level at each domain, and assignment to a particular form within a difficulty level was random. The order of presenting the domains was spiraled across tests. Because this was designed to assess the feasibility of NAEP procedures, not every student took every item. The links between forms were not clearly articulated, but it was noted that items were shared across forms. This study introduced several complex issues. The first issue that may influence ability estimation is whether learning had occurred between the test administrations. It is likely that this was the case, but the results did not mention any aspect of growth or change in science achievement

40 22 as the focus of the study. Also, assignment to forms was done using a set of complex decision rules as opposed to using an estimate of ability only. It may be the case that using characteristics other than an ability estimate to assign forms may have provided more appropriate assignments, but the use of characteristics external to the test in an operational testing situation, like certification or accountability purposes, would likely not be acceptable. Finally, the focus of this study was geared towards possible improvements in the data collection procedures used in NAEP. The authors noted that the inclusion of an adaptive testing component, combined with longitudinal data collection, could provide the ability to develop validity evidence of the NAEP scale. Two studies (Schnipke and Reese, 1999; Reese and Schnipke, 1999) addressed the use of testlets in computer adaptive testing. Both studies addressed the use of CAT and testlets to estimate ability at one particular point in time. Both studies indicated that the two-stage design produced less error across the range of ability than did a conventional paper and pencil test. A CAT that used testlets performed only slightly worse than did a traditional item-level CAT. These results indicate that the use of testlets can approximate the high degree of precision attained by item-level test adaptation. The tradeoff between the relatively small loss of precision compared to greater control in content balancing and ease of test administration would have to be considered based on the purposes of testing. With large-scale, relatively low-stakes assessments, the combined strength of paper and pencil administration with adaptive forms would be beneficial. One study examined an explicit application of computerized adaptive testing

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie