An Investigation of vertical scaling with item response theory using a multistage testing framework

Size: px
Start display at page:

Download "An Investigation of vertical scaling with item response theory using a multistage testing framework"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations 2008 An Investigation of vertical scaling with item response theory using a multistage testing framework Jonathan James Beard University of Iowa Copyright 2008 Jonathan James Beard This dissertation is available at Iowa Research Online: Recommended Citation Beard, Jonathan James. "An Investigation of vertical scaling with item response theory using a multistage testing framework." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 AN INVESTIGATION OF VERTICAL SCALING WITH ITEM RESPONSE THEORY USING A MULTISTAGE TESTING FRAMEWORK by Jonathan James Beard An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa December 2008 Thesis Supervisor: Associate Professor Timothy Ansley

3 1 ABSTRACT A simulation study was carried out to assess the effects of using different testing frameworks and different statistical estimators in constructing a vertical scale. The adaptive multistage testing framework (MST) was comprised of five test forms which were administered across three testing occasions. The single form testing framework (SFT) was comprised of one form at each of the three testing occasions. Maximum likelihood estimation (MLE) and Bayesian Expected a-posteriori (EAP) estimators were used to estimate each simulee s ability at three testing occasions. Item response theory (IRT) true scores, or domain scores, were used as the score scale. This was done to facilitate the use of growth scores between testing occasions. It was hypothesized that testing framework and estimation procedures would influence the recovery of the known domain score for each simulee across the three testing occasions and growth values between testing occasions. Average absolute deviation (AAD) values indicated that the MST framework offered a slight reduction in error when compared to the SFT framework in estimating IRT domain scores. The pattern of errors in estimation indicated that the MST framework provided more accurate estimates across the range of ability. The MST framework also offered a slight reduction in error when estimating IRT growth scores. Horizontal distances between test administrations indicted that EAP estimation produced uneven departures from known horizontal distances, but MLE did not. This was true for both the SFT and MST framework. Also, when the distributions of IRT

4 2 domain scores were considered, the MLE estimation method was more consistent with the distribution of known domain scores. Overall, the MST framework performed better than did the SFT framework with respect to reduced estimation error and approximating the known IRT domain score. Abstract Approved: Thesis Supervisor Title and Department Date

5 AN INVESTIGATION OF VERTICAL SCALING WITH ITEM RESPONSE THEORY USING A MULTISTAGE TESTING FRAMEWORK by Jonathan James Beard A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa December 2008 Thesis Supervisor: Associate Professor Timothy Ansley

6 Copyright by JONATHAN JAMES BEARD 2008 All Rights Reserved

7 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Jonathan James Beard has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the December 2008 graduation. Thesis Committee: Timothy Ansley, Thesis Supervisor Stephen Dunbar Michael Kolen Lelia Helms Won-Chan Lee

8 To James, Kate, Floyd, and Virginia ii

9 ACKNOWLEDGEMENTS I have always found myself to have the great fortune to be surrounded by those who are more gifted and talented than myself. The time I have spent completing this study is no exception. There are many who deserve my heartfelt thanks. I first wish to thank the patient and wise words offered by my advisor and thesis supervisor, Dr. Tim Ansley. Without him, this project could not have been completed. I would also like to thank the other members of my committee: Dr. Stephen Dunbar, Dr. Michael Kolen, Dr. Lelia Helms, and Dr. Won-Chan Lee. Their guidance served as a model for future academic thinking and writing. I want to thank my parents for providing with the drive to persevere. I also wish to thank my friends and companions that have helped me along the way. I would especially like to acknowledge the help of the following people: Tom Proctor, Kyong-Hee Chon, Tawnya Knupp, Scott Wood, Michelle Mengeling, Paul Westrick and Liz Hollingworth. A very special word of thanks is extended to Melissa Chapman and David Haynes. Above all, I would like to thank my wife, Michelle. She gives meaning to everything I do. iii

10 ABSTRACT A simulation study was carried out to assess the effects of using different testing frameworks and different statistical estimators in constructing a vertical scale. The adaptive multistage testing framework (MST) was comprised of five test forms which were administered across three testing occasions. The single form testing framework (SFT) was comprised of one form at each of the three testing occasions. Maximum likelihood estimation (MLE) and Bayesian Expected a-posteriori (EAP) estimators were used to estimate each simulee s ability at three testing occasions. Item response theory (IRT) true scores, or domain scores, were used as the score scale. This was done to facilitate the use of growth scores between testing occasions. It was hypothesized that testing framework and estimation procedures would influence the recovery of the known domain score for each simulee across the three testing occasions and growth values between testing occasions. Average absolute deviation (AAD) values indicated that the MST framework offered a slight reduction in error when compared to the SFT framework in estimating IRT domain scores. The pattern of errors in estimation indicated that the MST framework provided more accurate estimates across the range of ability. The MST framework also offered a slight reduction in error when estimating IRT growth scores. Horizontal distances between test administrations indicted that EAP estimation produced uneven departures from known horizontal distances, but MLE did not. This was true for both the SFT and MST framework. Also, when the distributions of IRT iv

11 domain scores were considered, the MLE estimation method was more consistent with the distribution of known domain scores. Overall, the MST framework performed better than did the SFT framework with respect to reduced estimation error and approximating the known IRT domain score. v

12 TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x CHAPTER 1 INTRODUCTION Adaptive Testing Vertical Scaling Vertical Scaling with Adaptive Tests Purpose of the Study Research Questions LITERATURE REVIEW Adaptive Testing Complete Test Form Adaptation Multistage Testing Computerized Adaptive Testing Adaptive Multistage Testing Summary Vertical Scaling Data Collection Growth Item Response Theory Scaling Parameterization Response Type Dimensionality Summary Assumptions Parameter Estimation Effects of Different Estimators Defining IRT Scale Relationships Linear Transformations Characteristic Curve Transformations Concurrent Calibration Vertical Scaling Using IRT Summary vi

13 2.8 NELS Testing Procedures IRT True Score Scaling METHODOLOGY Simulation Procedures Simulation of Known Abilities Generation of Item Responses Selection of Appropriate Items Ability Parameter Estimation MLE Estimation EAP Estimation Multiple Group Estimation BILOG-MG Procedures Summary Analysis Evaluation of IRT Domain Scores Evaluation of Growth Distance Measures Summary Research Hypotheses RESULTS Item Parameters and Simulated Values Test Characteristics and Item Parameters Simulated Ability Values Simulated Domain Score Values Estimated Values Estimated Ability Values Estimated Domain Score Values Estimated Growth Score Values Recovery of Scores Recovery of IRT Domain Scores Recovery of IRT Growth Scores Differences in Growth EAP Score Investigation Results Summary DISCUSSION Summary and Discussion IRT Domain Scores IRT Growth Scores vii

14 APPENDIX Separation of Distributions Summary of Findings Implications of the Study Limitations of the Study Future Research Conclusions A DESCRIPTION OF KNOWN ITEM PARAMETERS B ABILITY AND TRUE SCORE DISTRIBUTIONS C IRT TRUE GROWTH SCORE DISTRIBUTIONS D IRT TRUE SCORE COMPARISONS E DIFFERENCES IN HORIZONTAL DISTANCES ACROSS TESTING FRAMEWORKS AND ESTIMATION METHODS F ABILITY VALUES FROM JOINT ESTIMATION G BILOG-MG SYNTAX FILES H ABILITY VALUES FROM SEPARATE ESTIMATION REFERENCES viii

15 LIST OF TABLES Table Page 3.1 Statistics for IRT Domain Scores from NELS Statistics for Simulated Ability Values Statistics for Simulated IRT Domain Scores and IRT Domain Scores from NELS Descriptive Statistics for Estimated Ability Values Descriptive Statistics for IRT Domain Score Values Descriptive Statistics for Growth Score Values AAD for IRT Domain Scores Average Absolute Deviation for IRT Growth Scores Horizontal Distances at Selected Percentiles Effect Sizes Horizontal Distances for IRT Domain Scores Derived from Estimated Item Parameters using EAP Estimation in the MST Framework A.1 Item Parameters Used to Generate Response Data A.2 Summary of Item Parameters in Reading Item Pool A.3 Summary of Item Parameters and Information by Administration ix

16 LIST OF FIGURES Figure Page 1.1 Representation of Single Form Testing Sequence Representation of Adaptive Multistage Testing Sequence Three Item Characteristic Curves A.1 MST Base Year Test Information Curve A.2 MST First Follow-up Test Information Curves A.3 MST Second Follow-up Test Information Curves A.4 SFT First, Second, and Third Test Information Curves B.1 Known Ability Distributions at Each Testing Occasion B.2 Known IRT True Scores at Each Testing Occasion B.3 True Score Distributions From Observed NELS Data at Each Testing Occasion B.4 MST EAP Ability Distributions at Each Testing Occasion B.5 MST MLE Ability Distributions at Each Testing Occasion B.6 SFT EAP Ability Distributions at Each Testing Occasion B.7 SFT MLE Ability Distributions at Each Testing Occasion B.8 MST EAP True Score Distributions at Each Testing Occasion B.9 MST MLE True Score Distributions at Each Testing Occasion B.10 SFT EAP True Score Distributions at Each Testing Occasion B.11 SFT MLE True Score Distributions at Each Testing Occasion C.1 NELS Growth Score Distributions x

17 C.2 Simulated Growth Score Distributions C.3 MST EAP Growth Score Distributions C.4 MST MLE Growth Score Distributions C.5 SFT EAP Growth Score Distributions C.6 SFT MLE Growth Score Distributions D.1 MST EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.2 MST EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.3 MST EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.4 MST MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.5 MST MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.6 MST MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.7 SFT EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.8 SFT EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.9 SFT EAP Estimated IRT True Scores Compared to Known IRT True Scores at Time D.10 SFT MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time D.11 SFT MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time xi

18 D.12 SFT MLE Estimated IRT True Scores Compared to Known IRT True Scores at Time E.1 Differences in the Estimated First Horizontal Distance (ĤD) from Known Horizontal Distances (HD) for Each Testing Method and Statistical Estimator E.2 Differences in the Estimated Second Horizontal Distance (ĤD) from Known Horizontal Distances (HD) for Each Testing Method and Statistical Estimator F.1 MST EAP Ability Distributions at Each Testing Occasion based on Simultaneous Estimation of Ability and Item Parameters H.1 MST EAP Ability Distributions at Each Testing Occasion based on Separate Estimation of Ability and Item Parameters H.2 SFT EAP Ability Distributions at Each Testing Occasion based on Separate Estimation of Ability and Item Parameters xii

19 1 CHAPTER 1 INTRODUCTION The desire of educational policymakers in general and educational practitioners in particular to follow students development is a strong one (Seltzer et al., 1994). Many educational policies in effect today are premised on the belief that a score can be placed on a metric that facilitates inferences as to how much and how well a student understands content within a particular domain. Indeed, the requirement imposed by the No Child Left Behind Act of 2001 (NCLB, PL ) that students be tested in several grades in several subjects contributes to the notion that academic progress can and should be measured. However, the development of useful educational information is dependent upon a variety of factors, including the desired method of testing, what inferences are to be made based upon the results of the testing, and accountability constraints that must be accommodated. Two issues that relate directly to measuring and reporting student progress through the educational system are adaptive testing and vertical scaling. Both offer potential benefits to test developers and test score users. When used together, they may provide tractable solutions to complex testing problems. 1.1 Adaptive Testing Adaptive testing is a process that seeks to provide an efficient estimate of a test taker s ability using fewer items, with similar reliability, when compared to a traditional paper and pencil test. Adaptive testing can take place at the item level,

20 2 testlet (groups of items) level, and as a complete test. Item level adaptation is usually carried out through computerized adaptive testing (CAT), and has consistently been found to provide the lowest standard error of measurement of any testing mode (Weiss, 1982; Reese and Schnipke, 1999; Schnipke and Reese, 1999). Testlet level adaptation can be carried out using CAT or paper and pencil multistage tests. Complete test form adaptation has historically been carried out by using out-of-level testing. Typically done to match the ability of the test taker, it is usually administered on an individual basis at the discretion of teachers or other educational professionals (Plake and Hoover, 1979). 1.2 Vertical Scaling Vertical scaling seeks to provide a common metric between two tests that are similar in content, but different with respect to overall difficulty. The process of developing a vertical scale can accommodate different data collection designs, as well as different assumptions upon which the vertical scale is based. The resulting scale is useful for answering questions that are most concerned with change in test scores. This change is usually described as growth. Although vertical scales are useful for measuring growth in academic achievement, constructing a vertical scale can be difficult. It requires the use of judgment, a knowledge of statistical techniques and their assumptions, familiarity with benefits and tradeoffs, and an acceptance of the limitations placed upon the procedures. A typical vertical scaling would be carried out by having two groups of stu-

21 3 dents that differ in average ability, and each group would take a test that is specifically designed to represent the appropriate level of difficulty. Therefore, each test is appropriate for the intended group, but when the tests compared to each other, the test for the higher ability group is more difficult than the test for the lower ability group. For example, if a vertical scale were to be constructed to describe math problem solving skills, each group would have a test assessing the appropriate content of math problem solving. There would also be items on each test that establish a link between groups. The difference between the two groups performance on the common items would establish the nature of the math problem solving scale. Much research has been devoted to investigating properties of vertical scales, and the issues of mean growth trajectory and variability of scores across ability levels are the most debated with respect to their influence on vertical scaling methods. This debate was particularly energetic when Thurstone, item response theory (IRT), and Hieronymus methods were compared to each other, with studies showing that the growth that is implied by these methods differs widely (Hoover, 1984a; Burket, 1984; Hoover, 1984b; Clemans, 1993; Yen et al., 1996b; Clemans, 1996; Yen et al., 1996a; Seltzer et al., 1994; Schulz and Nicewander, 1997; Williams et al., 1998). Some of the studies carried out different scalings on the same data. These studies are addressed more fully in the literature review, but it is important to recognize that differences in average mean growth and score variability across test administrations will likely influence the outcome of a vertical scale.

22 4 Figure 1.1: Representation of Single Form Testing Sequence 1.3 Vertical Scaling with Adaptive Tests A traditional procedure of establishing a vertical scale would use a single test form for different groups (say grades, for example). The difficulty of the forms at each grade would progressively increase, such that each form is designed to be appropriate for the level in which it is administered. Vertical scaling establishes a link between these forms such that scores from any of the forms has direct meaning with respect to the domain of interest. A representation of this testing sequence can be seen in Figure 1.1. Vertically scaling these types of forms could be done using item response theory (IRT) or Thurstone methods. Adding to the complexity of developing a vertical scale is the use of different test forms within a test administration that are tailored to the ability of the testtaker. The additional complexity arises when there are two (or possibly more) forms that are used within a particular testing administration where each form is designed to measure the same construct, but is different in overall difficulty. Thus, not only are forms different in difficulty between administrations, but the multiple forms within an administration are also different in difficulty.

23 5 Figure 1.2: Representation of Adaptive Multistage Testing Sequence A specific framework of IRT applied to the processes of adaptive multistage testing and vertical scaling simultaneously addresses these differences among test form difficulty and group ability. This process involves the repeated assessment of a cohort of students over a number of years, and tailoring follow-up tests based upon prior estimates of ability. A representation of this testing sequence can be seen in Figure 1.2. Combining the desirable aspects of test adaptability and commonality of scale has the potential to provide defensible solutions to complex testing problems. Following the illustration in Figure 1.2, the adaptive framework would result in different groups of students taking different tests at time 2. Assignment to the easy (E 1 ) or difficult (D 1 ) form would depend on the result of the earlier administered base-year form (B). If students scored above some critical value of θ 1, these more able students would receive the difficult test. The less able students, scoring below the critical value of θ 1, would receive the easier test. If another testing occasion were to occur, assignment to the form used at time 3 would similarly be based on the score a student received at time 2 (θ 2 ).

24 6 In order for the change in scores from time 1 to time 2 to have meaning, the scale must accommodate differences in test difficulty across administration times and within a test administration. In order to produce a meaningful vertical scale, each form at time 3, each form at time 2, and the base-year form must be on the same scale. Therefore, the ultimate goal is to have a scale that will relate the ability values (θ 1, θ 2, and θ 3 ) in the same way regardless of which forms a particular student takes. Vertical scaling in an adaptive multistage framework seeks to place the scores from these tests on the same scale, and developing a vertical scale that simultaneously accommodates scores from all of the forms to a common metric is the focus of interest here. Several national studies have used a longitudinal, multistage testing framework similar to the one described above. Three notable studies are: National Educational Longitudinal Study (NELS, 1988), Education Longitudinal Study (ELS, 2002), and the Early Childhood Longitudinal Study (ECLS). Each of these studies provides a wealth of information that serves as the foundation for research in multiple academic disciplines, including sociology, psychology, education, educational policy and policy evaluation. Included in the data is test score information that is based upon several testing occasions. Adaptive multistage tests have been used in all of these data collections. As an example, the NELS:88 base-year test for Reading was the same for all students. Based upon the performance on the base-year exam, students received either an easy or a difficult form for the first follow-up. Similarly, based upon the

25 7 performance on the first follow-up, students received either an easy or hard form for the second follow-up. For Mathematics, three forms (easy, medium, and hard) were used. Estimates of student proficiency were provided for three points in time (specifically, for years 1988, 1990, and 1992). It is these scores that are of interest to evaluators and researchers who are interested in measuring change. The scores that are recommended for investigations into change or growth by NCES are called IRT estimated number right scores. These scores are the number of items a student would be likely to get correct if he or she had taken every item in the calibrated item pool. This value is calculated by applying an estimate of ability to each item, calculating the probability of answering the item correctly (based on an IRT model), and adding these probability values across items. These values have been called IRT true scores (Lord, 1980; Yen and Fitzpatrick, 2006) and IRT domain scores (Bock, 1997; Bock et al., 1997). This metric is useful because it is on the number correct scale (even though the values are fractional) and change scores can be described in terms of an increase in expected number right. For this study, the term domain score will be used. This term is appropriate because the development of a particular student is anchored to the particular set of items that comprise the domain, and changes in scores are based on the same sample of items in the domain. The workability of vertical scaling in an adaptive multistage testing framework is usually heavily dependent on the assumptions of item response theory. This framework capitalizes on two very specific benefits of IRT. The first is called the item-free property of ability parameters and the second is called the person-free property of

26 8 item parameters. The item-free property states that a person s ability is independent of the particular sample of items used. Thus, if one set of thirty items was chosen from a calibrated item pool, and a second set of thirty items was chosen from the same pool, the estimate of ability from each test should differ only by errors of measurement if the test taker were to take both forms (Lord, 1980; Harris and Hoover, 1987; Hambleton, 1989; Hambleton et al., 1991). Likewise, statistical characteristics of items are independent of the particular sample drawn from a population used to calibrate the items. This person-free property means that if a set of thirty items was administered to one sample of test takers, and the same set of thirty items was administered to another set of similar test takers, the resulting item parameter estimates for the same items are expected to be linearly related (Lord, 1980). These two properties are collectively known as the invariance properties of IRT. Harris and Hoover note that if these properties held, they would essentially solve the problem of vertical [scaling] (1987). If these invariance properties hold, the logical question may be asked as to why any type of adaptation would be necessary. If the location and scale of two different ability estimates have been linked, why is it beneficial to give different groups of people different sets of items? The answer is because of the precision of an ability estimate. The precision that different sets of items provide will differ over the range of ability. The purpose of adapting a test to a previous estimate of ability is to minimize this error in measurement. This is especially important when accurate estimation is desired at the upper and lower range of ability.

27 9 1.4 Purpose of the Study Vertical scaling in an adaptive multistage framework seeks to place scores from tests that are intentionally different in difficulty, and taken by groups that are intentionally different in ability, on the same scale. The reasons for investigating this process are more than academic; real consequences may be attached to vertical scaling results. It is imperative that when evaluative decisions are made about program efficacy, school funding, or policy directives based upon student test scores, those scores should be provided in the most defensible manner possible. This issue may also be especially true with respect to high-stakes decisions regarding certification or licensing. Combining the desirable aspects of vertical scaling with adaptive testing should provide scores that are accurate, defensible, and appropriate. The focus of this dissertation is to investigate the effect of using an adaptive multistage testing framework with item response theory scaling techniques to establish a vertical scale. The issue of importance is whether vertical scaling using an adaptive multistage testing framework influences the accuracy of measuring ability and ability growth. It is important to understand the effect of using an adaptive multistage testing framework when vertically scaling achievement test data because it seems entirely plausible that using forms which are intentionally different in difficulty may influence the nature of the vertical scale, and questions of appropriate interpretation of the data will likely arise. Large scale data collections, like those mentioned earlier, provide educational data for research purposes that will be used in a vast array of disciplines. It is likely

28 10 that statements about the effectiveness of interventions, programs, or other issues surrounding educational performance of students will occur. Whether the adaptive multistage framework might influence any statement or conclusion about the effectiveness of schools, teachers, or even principals, is unknown. Currently, no research exists that investigates the use of IRT in a longitudinal data collection design that incorporates an adaptive multistage testing framework in a vertical scaling context. This study seeks to fill this gap in the literature. Multiple studies have addressed the use of IRT when constructing a vertical scale, but none has systematically investigated the possible benefits of an adaptive multistage testing framework. The literature demonstrates that the desire to follow students progress is likely to persist, and the use of IRT procedures is likely to continue. Additionally, the desire to accommodate a test to the ability of the test taker seems to be reasonable, so long as scores from the different forms can be placed in relation to each other in a defensible manner. Using an an adaptive multistage testing framework may provide a tractable solution to competing demands on testing technology: accuracy of ability estimation, comparability of scale scores across a developmental range, and effectually accommodating testing to the ability of the test taker Research Questions Specific research questions were formulated such that the each question would address a specific issue when using IRT to develop a vertical scale in an adaptive

29 11 multistage testing framework. The research questions for this study are: 1. Does the use of an adaptive multistage testing framework influence the recovery of IRT domain score values? 2. Does the use of different statistical estimators influence the recovery of IRT domain score values? 3. Does the use of an adaptive multistage testing framework influence the recovery of IRT true growth score values? 4. Does the use of different statistical estimators influence the recovery of IRT true growth score values? 5. Do differences in testing frameworks and statistical estimators influence growth across the range of score distributions?

30 12 CHAPTER 2 LITERATURE REVIEW This chapter provides an introduction to several issues related to adaptive testing and vertical scaling. The first section will address general aspects of adaptive testing, including adaptive testing situations that are designed for longitudinal analysis. The next section includes a discussion of vertical scaling. Item response theory is discussed next. Following this is a discussion of methods that establish a common IRT metric for different test administrations. Lastly, a review of studies investigating vertical scaling using item response theory is presented. 2.1 Adaptive Testing Adaptive testing, in its purest sense, is not a new idea. The desire to accurately gauge the level of performance of a student, without overburdening either the test taker or the administrator has been around since tests were first given (Wainer and Kiely, 1987; Wainer et al., 2000). One example of early adaptive testing would be Binet s intelligence test. Weiss (1985; 1982) notes that Binet s test possessed the four general rudiments of an adaptive testing framework: 1. The starting point on the test was variable (based on a previous estimate of ability). 2. Items were scored as they were administered. 3. A following item was chosen based on the score on the preceding item.

31 13 4. Testing was stopped based on a predefined termination rule. Unfortunately, individualized testing is not a feasible option for implementing efficient large-scale procedures. However, it seems that developments in testing technology can provide an approximation to the ultimate goal of individualized assessment: targeting questions to the ability of the test taker and reliably measuring that ability to some pre-specified degree of precision. There are three general types of adaptation that will be introduced here: computerized adaptive testing (CAT), multistage testing (MST), and complete or whole test form adaptation. Each has benefits over traditional paper and pencil (P&P) testing, and burdens that P&P testing does not have. Most comparisons of CAT are made against a similarly situated P&P test (same content, same population of test takers, etc.). These traditional tests are usually described as linear tests. To be effective in its purpose of measuring a large group of heterogeneous test takers accurately, a linear P&P test is usually designed to cover a broad range of ability. The test design also generally includes a few difficult items for more able students; it has a few easy items for less able students, and a vast majority of items are designed to measure a particular trait well at the middle range of ability (Wainer et al., 2000). Although this framework is good, it may have some consequences that are unintentional, but nonetheless potentially influential for test takers at either extreme of the ability range. Less able test takers may become frustrated, demoralized, and possibly hostile to the testing situation. Highly able students may become bored,

32 14 distracted, and possibly dismissive of the testing situation. Neither of these results is desirable, and adaptive testing procedures might offer superior alternatives Complete Test Form Adaptation Out-of-level testing (also called functional testing or off-level testing) is a process by which a student is given a test form that is different from the majority of other students in his or her grade. The purpose of this practice is to make the relationship between the student s ability and the difficulty of the test less disparate. This can mean giving a more able student a higher level of the test, or a less able student a lower level of the test. In essence, it is believed that the resulting test score from an off-level test would be more accurate... since [students] are no longer guessing or answering carelessly (Minnema et al., 2000). If scaling procedures have been carried out correctly, the overall impact of allowing a student to test out-of-level should be minimal. Although administering a test out-of-level is quite easy, the substantive meaning and interpretations garnered from an off-level test score are not as easily understood. Ayrer and McNamara (1973) studied the effects of off-level testing in a Philadelphia school district. They noted that with an increase of students taking an off-level test, the mean GE for that particular grade was lowered by about.3 over the course of two years. Long, Shaffran, and Kellog (1977) administered the on-grade and off-grade level of the Gates-MacGinitie Vocabulary and Comprehension subtests to a sample

33 15 of students who were one or more grade levels behind in reading based on scores from another test of vocabulary. Students were assigned an on-level test, and an off-level test based upon scores on the Botel Word Opposites Test. The order in which the onlevel and off-level tests were administered was counterbalanced to avoid introducing any order effect. In this study, students in grade 2 could take a test that was one grade behind; students in grade 3 could take forms designed for grades one or two; and students in grade 4 could take forms designed for grades one, two, or three. Results indicated that for grades 2 and 3, the off-level tests consistently produced higher GE scores. For grade 4, the off-level tests produced systematically lower GE scores. These results are problematic, especially with respect to test score reporting for accountability purposes. If an off-level administration of a test is designed to provide a more accurate estimate of ability or achievement, the resulting scores should not have been as disparate. These results also indicate that teachers, policymakers, and other stakeholders, would seek from the measurement profession a manner of scaling that would produce scores which could more easily be compared across grades, especially when issues of accountability and funding are considered Multistage Testing The decision to use multistage testing involves weighing the relative burdens and benefits that are pertinent to a particular testing situation. A multistage testing framework offers some of the major benefits of CAT, especially when compared to linear tests.

34 16 Multistage testing can take place as a CAT, or it can take place using paper and pencil as the mode of administration. The main distinction of using a multistage test (MST) versus a traditional CAT is that adaptation of the testing environment to the ability of the test taker does not occur at the item level. Adaptation occurs at the testlet level, which offers distinct advantages over traditional linear tests, and some advantages and drawbacks when compared to a traditional item level CAT. When compared to an item level CAT, the MST offers the advantages of greater control over test construction (balancing of content domains), a more plausible assumption of item independence (between testlets), increased control of item ordering, allowing for item review within a testlet, and fewer data management demands (Hendrickson, 2007). However, there are some disadvantages of an MST when compared to an item level CAT. One disadvantage is that more items are generally needed to reach the same level of precision. The use of testlets also has bearing on the test development process. A testlet must be developed as a whole, and this may be a greater burden with respect to item development. Also, if a particular item within a testlet begins to function poorly, there is no known procedure in which it could be replaced while retaining the original functionality of the testlet Computerized Adaptive Testing Computerized adaptive testing necessarily requires development of computer software and compatible hardware, but there are several other aspects that must be considered before a fully operational system is in working order. Weiss and Kingsbury

35 17 (1984) notes the several components: 1. Item response model: Parametric and non-parametric models are available for use. The important issue is to make sure that the model chosen is the most appropriate for the data available. 2. Item pool: The number of items to be made available to the test taker must be large enough to avoid item overexposure, and be varied enough to allow precise measurement throughout the range of ability. Also, the items must be calibrated to be on the same scale. 3. Entry level: The location of an item along the item difficulty continuum that begins a CAT session can be changed. The first item given to a test taker is usually set around the middle difficulty range, with subsequent items adapting to the particular pattern of responses made by the test taker. The difficulty of the item at the starting point generally does not have a large impact on a person s score. 4. Item selection rule: Along with adjusting for difficulty and the pattern of responses of the test taker, issues of process and content balance may enter into item selection. An item may enter into a testing sequence that may not be the most statistically optimal, but the item covers a domain that may not have been satisfactorily measured. Two general statistical procedures are used to sequentially select items. The first is the maximum information function (when MLE is used), and the second

36 18 is the minimum posterior variance (when Bayesian estimation is used). Because both procedures are related to the information function, the items selected by both procedures are usually quite similar. 5. Scoring method: Two general methods are used in scoring responses in a CAT: Maximum likelihood and Bayesian. Bayesian estimation can generally accommodate response patterns that would be troublesome under maximum likelihood scoring. However, Bayesian estimation regresses estimated ability on prior estimates of ability. Maximum likelihood estimation is asymptotically unbiased, thus producing the smallest variance for unbiased estimators. Bayesian estimation produces biased estimates of ability, but usually has a smaller variance than MLE. Weiss and Kingsbury (1984) suggest a compromise of using both estimators to capitalize on the benefits of each. 6. Termination rule: When using either MLE or Bayesian methods, the decision to end the testing session must be governed by some criterion. Some tests are fixed length, where a pre-specified number of items are given. Other tests stop after the error of measurement has become sufficiently small to provide a reasonable point estimate and narrow range of the resulting test score. Some criteria may be specified with respect to some cut score, and testing stops if a person s score is sufficiently above or below the score to declare him or her a master or non-master.

37 19 There are several benefits of CAT. Although the realization of full benefits had to wait until developments in technology could allow them. With the advent of sophisticated desktop computers, those benefits became a reality. The major benefits of computerized adaptive testing are: Efficiency Adaptive testing generally produces a test that is at least as reliable as a linear test with considerably fewer items being administered. Precision Adaptive tests generally provide greater precision (i.e., less measurement error) over a larger range of the ability continuum when compared to a linear test. Appropriateness Adaptive testing tries to align the difficulty of the items chosen to the best estimate of the ability of a test taker. The issue of having to answer very hard and very easy items is avoided. Immediacy Test scores are immediately available to a test taker. Customizability Given particular testing constraints, tests can be given on a more flexible schedule; tests could proceed at a test taker s own pace; items that seem to be troublesome can be removed; and novel item formats can be used. 2.2 Adaptive Multistage Testing Most implementations of adaptive testing are concerned with accurate measurement of a particular trait or ability at one particular point in time. This is true whether the mode of administration is adaptive at the item level, testlet level, or mul-

38 20 tistage level. This is to be expected, especially given the purposes for which adaptive testing has been applied (e.g., admissions, certification, etc.). Most studies of multistage testing also apply the process of ability estimation to one particular point in time. However, the methods that are used to estimate ability for one administration may be successfully applied to monitoring change in ability over time. One early study of multistage testing was carried out by Linn, Rock, and Cleary (1969) using the SCAT and STEP tests. Although the procedures were not based on IRT methods, the information used to develop the two-stage forms approximated the idea of producing a reliable estimate of ability with fewer items. An index relating the ratio of the number of items on a conventional test needed to match the validity coefficient (between these tests and the PSAT Verbal and Math tests) was used as a criterion. The authors found that the value for two-stage tests was 3.36 based on separate group regressions, and 2.33 for a regression based on a common slope. This study indicates that even when items were grouped based upon classical item level statistics, tailoring items to the ability of the test taker was beneficial. Lord (1971) investigated the application of different adaptation procedures in adaptive testing. He noted that this application was a break from earlier two-stage testing used in personnel decision making, where borderline examinees took a second exam that would facilitate a better classification decision of selection or rejection. In the new application, the second level of testing would apply to all test takers, and the added utility of this is more precise measurement based upon current estimates of ability. The main results from this study demonstrated that different adaptive

39 21 methods provide greater information at different points along the ability range. As Lord expected, the adaptive procedure produced less error variance at the high and low end of the ability range. Bock and Zimowski (1998) carried out a feasibility study of two-stage testing as it might be implemented in the National Assessment of Educational Progress (NAEP). A test of science in four areas (earth science, biology, chemistry, and physics) was given to a sample of secondary students in Ohio. Tests were assembled based upon science items received from an earlier request for science-based items to state, provincial, and national testing programs. The first-stage test was administered in January or February of 1991, and the second-stage test was administered in April or May. Assignment to the secondstage form was based upon: the subdomain number correct score, the number of courses in the subdomain, and the number correct score for the total test. Random (assumedly parallel) forms were constructed within a difficulty level at each domain, and assignment to a particular form within a difficulty level was random. The order of presenting the domains was spiraled across tests. Because this was designed to assess the feasibility of NAEP procedures, not every student took every item. The links between forms were not clearly articulated, but it was noted that items were shared across forms. This study introduced several complex issues. The first issue that may influence ability estimation is whether learning had occurred between the test administrations. It is likely that this was the case, but the results did not mention any aspect of growth or change in science achievement

40 22 as the focus of the study. Also, assignment to forms was done using a set of complex decision rules as opposed to using an estimate of ability only. It may be the case that using characteristics other than an ability estimate to assign forms may have provided more appropriate assignments, but the use of characteristics external to the test in an operational testing situation, like certification or accountability purposes, would likely not be acceptable. Finally, the focus of this study was geared towards possible improvements in the data collection procedures used in NAEP. The authors noted that the inclusion of an adaptive testing component, combined with longitudinal data collection, could provide the ability to develop validity evidence of the NAEP scale. Two studies (Schnipke and Reese, 1999; Reese and Schnipke, 1999) addressed the use of testlets in computer adaptive testing. Both studies addressed the use of CAT and testlets to estimate ability at one particular point in time. Both studies indicated that the two-stage design produced less error across the range of ability than did a conventional paper and pencil test. A CAT that used testlets performed only slightly worse than did a traditional item-level CAT. These results indicate that the use of testlets can approximate the high degree of precision attained by item-level test adaptation. The tradeoff between the relatively small loss of precision compared to greater control in content balancing and ease of test administration would have to be considered based on the purposes of testing. With large-scale, relatively low-stakes assessments, the combined strength of paper and pencil administration with adaptive forms would be beneficial. One study examined an explicit application of computerized adaptive testing

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Chapter 6 Topic 6B Test Bias and Other Controversies. The Question of Test Bias

Chapter 6 Topic 6B Test Bias and Other Controversies. The Question of Test Bias Chapter 6 Topic 6B Test Bias and Other Controversies The Question of Test Bias Test bias is an objective, empirical question, not a matter of personal judgment. Test bias is a technical concept of amenable

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

Item Selection in Polytomous CAT

Item Selection in Polytomous CAT Item Selection in Polytomous CAT Bernard P. Veldkamp* Department of Educational Measurement and Data-Analysis, University of Twente, P.O.Box 217, 7500 AE Enschede, The etherlands 6XPPDU\,QSRO\WRPRXV&$7LWHPVFDQEHVHOHFWHGXVLQJ)LVKHU,QIRUPDWLRQ

More information

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology ISC- GRADE XI HUMANITIES (2018-19) PSYCHOLOGY Chapter 2- Methods of Psychology OUTLINE OF THE CHAPTER (i) Scientific Methods in Psychology -observation, case study, surveys, psychological tests, experimentation

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information

New Mexico TEAM Professional Development Module: Autism

New Mexico TEAM Professional Development Module: Autism [Slide 1]: Welcome Welcome to the New Mexico TEAM technical assistance module on making eligibility determinations under the category of autism. This module will review the guidance of the NM TEAM section

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

For general queries, contact

For general queries, contact Much of the work in Bayesian econometrics has focused on showing the value of Bayesian methods for parametric models (see, for example, Geweke (2005), Koop (2003), Li and Tobias (2011), and Rossi, Allenby,

More information

Computerized Adaptive Testing

Computerized Adaptive Testing Computerized Adaptive Testing Daniel O. Segall Defense Manpower Data Center United States Department of Defense Encyclopedia of Social Measurement, in press OUTLINE 1. CAT Response Models 2. Test Score

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Advanced Dental Admission Test (ADAT) Official Results: 2017

Advanced Dental Admission Test (ADAT) Official Results: 2017 Advanced Dental Admission Test (ADAT) Official Results: 2017 Normative Period: April 3 through August 31, 2017 Number of ADAT Candidates: 481 Number of ADAT Administrations: 483 Report Date: September

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1 SLEEP DISTURBANCE A brief guide to the PROMIS Sleep Disturbance instruments: ADULT PROMIS Item Bank v1.0 Sleep Disturbance PROMIS Short Form v1.0 Sleep Disturbance 4a PROMIS Short Form v1.0 Sleep Disturbance

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,

More information

Critical Thinking Assessment at MCC. How are we doing?

Critical Thinking Assessment at MCC. How are we doing? Critical Thinking Assessment at MCC How are we doing? Prepared by Maura McCool, M.S. Office of Research, Evaluation and Assessment Metropolitan Community Colleges Fall 2003 1 General Education Assessment

More information

Supplementary Material*

Supplementary Material* Supplementary Material* Lipner RS, Brossman BG, Samonte KM, Durning SJ. Effect of Access to an Electronic Medical Resource on Performance Characteristics of a Certification Examination. A Randomized Controlled

More information

ANXIETY A brief guide to the PROMIS Anxiety instruments:

ANXIETY A brief guide to the PROMIS Anxiety instruments: ANXIETY A brief guide to the PROMIS Anxiety instruments: ADULT PEDIATRIC PARENT PROXY PROMIS Pediatric Bank v1.0 Anxiety PROMIS Pediatric Short Form v1.0 - Anxiety 8a PROMIS Item Bank v1.0 Anxiety PROMIS

More information

New Mexico TEAM Professional Development Module: Deaf-blindness

New Mexico TEAM Professional Development Module: Deaf-blindness [Slide 1] Welcome Welcome to the New Mexico TEAM technical assistance module on making eligibility determinations under the category of deaf-blindness. This module will review the guidance of the NM TEAM

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

FATIGUE. A brief guide to the PROMIS Fatigue instruments:

FATIGUE. A brief guide to the PROMIS Fatigue instruments: FATIGUE A brief guide to the PROMIS Fatigue instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS Ca Bank v1.0 Fatigue PROMIS Pediatric Bank v2.0 Fatigue PROMIS Pediatric Bank v1.0 Fatigue* PROMIS

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan

More information

SPECIAL EDUCATION DEAF EDUCATION ENDORSEMENT PROGRAM

SPECIAL EDUCATION DEAF EDUCATION ENDORSEMENT PROGRAM 505-3-.98 SPECIAL EDUCATION DEAF EDUCATION ENDORSEMENT PROGRAM To Become Effective June 15, 2016 Nature of Amendment(s): Substantive Clarification Further Discussion: It is proposed that GaPSC Rule 505-3-.98

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments:

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments: PROMIS Bank v1.0 - Physical Function* PROMIS Short Form v1.0 Physical Function 4a* PROMIS Short Form v1.0-physical Function 6a* PROMIS Short Form v1.0-physical Function 8a* PROMIS Short Form v1.0 Physical

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

Chapter 4 Research Methodology

Chapter 4 Research Methodology Chapter 4 Research Methodology 137 RESEARCH METHODOLOGY Research Gap Having done a thorough literature review on gender diversity practices in IT organisations, it has been observed that there exists a

More information

Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners

Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners Test-Taking Strategies and Task-based Assessment: The Case of Iranian EFL Learners Hossein Barati Department of English, Faculty of Foreign Languages, University of Isfahan barati@yahoo.com Zohreh Kashkoul*

More information

ABOUT SMOKING NEGATIVE PSYCHOSOCIAL EXPECTANCIES

ABOUT SMOKING NEGATIVE PSYCHOSOCIAL EXPECTANCIES Smoking Negative Psychosocial Expectancies A brief guide to the PROMIS Smoking Negative Psychosocial Expectancies instruments: ADULT PROMIS Item Bank v1.0 Smoking Negative Psychosocial Expectancies for

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference*

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference* PROMIS Item Bank v1.1 Pain Interference PROMIS Item Bank v1.0 Pain Interference* PROMIS Short Form v1.0 Pain Interference 4a PROMIS Short Form v1.0 Pain Interference 6a PROMIS Short Form v1.0 Pain Interference

More information

Smoking Social Motivations

Smoking Social Motivations Smoking Social Motivations A brief guide to the PROMIS Smoking Social Motivations instruments: ADULT PROMIS Item Bank v1.0 Smoking Social Motivations for All Smokers PROMIS Item Bank v1.0 Smoking Social

More information

A Comparison of Four Test Equating Methods

A Comparison of Four Test Equating Methods A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS Attached are the performance reports and analyses for participants from your surgery program on the

More information

The Effects of Societal Versus Professor Stereotype Threats on Female Math Performance

The Effects of Societal Versus Professor Stereotype Threats on Female Math Performance The Effects of Societal Versus Professor Stereotype Threats on Female Math Performance Lauren Byrne, Melannie Tate Faculty Sponsor: Bianca Basten, Department of Psychology ABSTRACT Psychological research

More information

Chapter-2 RESEARCH DESIGN

Chapter-2 RESEARCH DESIGN Chapter-2 RESEARCH DESIGN 33 2.1 Introduction to Research Methodology: The general meaning of research is the search for knowledge. Research is also defined as a careful investigation or inquiry, especially

More information

2016 Technical Report National Board Dental Hygiene Examination

2016 Technical Report National Board Dental Hygiene Examination 2016 Technical Report National Board Dental Hygiene Examination 2017 Joint Commission on National Dental Examinations All rights reserved. 211 East Chicago Avenue Chicago, Illinois 60611-2637 800.232.1694

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

INTRODUCTION TO ASSESSMENT OPTIONS

INTRODUCTION TO ASSESSMENT OPTIONS DEPRESSION A brief guide to the PROMIS Depression instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.0 Depression PROMIS Pediatric Item Bank v2.0 Depressive Symptoms PROMIS Pediatric

More information

CHAPTER V. Summary and Recommendations. policies, including uniforms (Behling, 1994). The purpose of this study was to

CHAPTER V. Summary and Recommendations. policies, including uniforms (Behling, 1994). The purpose of this study was to HAPTER V Summary and Recommendations The current belief that fashionable clothing worn to school by students influences their attitude and behavior is the major impetus behind the adoption of stricter

More information

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,

More information

Statistical Methods and Reasoning for the Clinical Sciences

Statistical Methods and Reasoning for the Clinical Sciences Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

Models in Educational Measurement

Models in Educational Measurement Models in Educational Measurement Jan-Eric Gustafsson Department of Education and Special Education University of Gothenburg Background Measurement in education and psychology has increasingly come to

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

Using response time data to inform the coding of omitted responses

Using response time data to inform the coding of omitted responses Psychological Test and Assessment Modeling, Volume 58, 2016 (4), 671-701 Using response time data to inform the coding of omitted responses Jonathan P. Weeks 1, Matthias von Davier & Kentaro Yamamoto Abstract

More information

The power of positive thinking: the effects of selfesteem, explanatory style, and trait hope on emotional wellbeing

The power of positive thinking: the effects of selfesteem, explanatory style, and trait hope on emotional wellbeing University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2009 The power of positive thinking: the effects of selfesteem,

More information

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Teodora M. Salubayba St. Scholastica s College-Manila dory41@yahoo.com Abstract Mathematics word-problem

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

Estimating the number of components with defects post-release that showed no defects in testing

Estimating the number of components with defects post-release that showed no defects in testing SOFTWARE TESTING, VERIFICATION AND RELIABILITY Softw. Test. Verif. Reliab. 2002; 12:93 122 (DOI: 10.1002/stvr.235) Estimating the number of components with defects post-release that showed no defects in

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Public Health Masters (MPH) Competencies and Coursework by Major

Public Health Masters (MPH) Competencies and Coursework by Major I. Master of Science of Public Health A. Core Competencies B. Major Specific Competencies i. Professional Health Education ii. iii. iv. Family Activity Physical Activity Behavioral, Social, and Community

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

INTERNATIONAL STANDARD ON ASSURANCE ENGAGEMENTS 3000 ASSURANCE ENGAGEMENTS OTHER THAN AUDITS OR REVIEWS OF HISTORICAL FINANCIAL INFORMATION CONTENTS

INTERNATIONAL STANDARD ON ASSURANCE ENGAGEMENTS 3000 ASSURANCE ENGAGEMENTS OTHER THAN AUDITS OR REVIEWS OF HISTORICAL FINANCIAL INFORMATION CONTENTS INTERNATIONAL STANDARD ON ASSURANCE ENGAGEMENTS 3000 ASSURANCE ENGAGEMENTS OTHER THAN AUDITS OR REVIEWS OF HISTORICAL FINANCIAL INFORMATION (Effective for assurance reports dated on or after January 1,

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) Standaarden voor kerndoelen basisonderwijs : de ontwikkeling van standaarden voor kerndoelen basisonderwijs op basis van resultaten uit peilingsonderzoek van der

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information