Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Size: px
Start display at page:

Download "Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie Topczewski University of Iowa Copyright 2013 Anna Marie Topczewski This dissertation is available at Iowa Research Online: Recommended Citation Topczewski, Anna Marie. "Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 EFFECT OF VIOLATING UNIDIMENSIONAL ITEM RESPONSE THEORY VERTICAL SCALING ASSUMPTIONS ON DEVELOPMENTAL SCORE SCALES by Anna Marie Topczewski A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa August 2013 Thesis Supervisor: Professor Michael J. Kolen

3 Copyright by ANNA MARIE TOPCZEWSKI 2013 All Rights Reserved

4 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Anna Marie Topczewski has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the August 2013 graduation. Thesis Committee: Michael J. Kolen, Thesis Supervisor Timothy N. Ansley Mary Kathryn Cowles Deborah Harris Won-Chan Lee

5 To my little butterfly, Chantel Without whom I never would have had the strength to get this far ii

6 ACKNOWLEDGEMENTS This dissertation is what it is today because of the many individuals who supported me during this journey. I first need to thank my husband, Joe. You laughed with me in the good times, and you struggled with me in the bad times. You lived with me, and you brought out the best in me always. You comforted me when I was downhearted, and you shared in all of my accomplishments. And for you I am the most that I can. You, Chantel, and Hailey were with me the entire time and words cannot describe how thankful I am to the three of you. I wish to thank my thesis advisor, Dr. Michael Kolen. Without your guidance, suggestions, and encouragement this work would not have been completed. I would like to thank my other committee members: Dr. Timothy Ansley, Dr. Mary Kathryn Cowles, Dr. Deborah Harris, and Dr. Won-Chan Lee. Your thoughts and perspectives pushed me to explore ideas that made this work better. And a special thanks to Dave Woodruff, your edits and comments were invaluable. I would like to thank Iowa Testing Programs for providing me with resources to complete a graduate career. I want to thank all my fellow students of the 200 office, in particular Karoline who succeeded the dissertation desk to me. The sunshine was what I needed to clear my head during this tiring process. To all my friends, who gave me support, encouragement, and reminded me to take a break once and awhile. And lastly, I would like to thank my parents, John and Colleen, my brother John, my parent-in-laws, John and Janice, my siblings-in-laws, Jeremy and Janell, and mommy #2 Judy. iii

7 ABSTRACT Developmental score scales represent the performance of students along a continuum, where as students learn more they move higher along that continuum. Unidimensional item response theory (UIRT) vertical scaling has become a commonly used method to create developmental score scales. Research has shown that UIRT vertical scaling methods can be inconsistent in estimating grade-to-grade growth, withingrade variability, and separation of grade distributions (effect size) of developmental score scale. In particular the finding of scale shrinkage (decreasing within-grade score variability as grade-level increases) has led to concerns about and criticism of IRT vertical scales. The causes of scale shrinkage have yet to be fully understood. Real test data and simulation studies have been unable to provide complete answers as to why IRT vertical scaling inconsistencies occur. Violations of assumptions have been a commonly cited potential cause for the inconsistent results. For this reason, this dissertation is an extensive investigation into how violations of the three assumptions of UIRT vertical scaling local item dependence, unidimensionality, and similar reliability of grade level tests affect estimated developmental score scales. Simulated tests were developed that purposefully violated a UIRT vertical scaling assumption. Three sets of simulated tests were created to test the effect of violating a single assumption. First, simulated tests were created with increasing, decreasing, low, medium, and high local item dependence. Second, multidimensional simulated tests were created by varying the correlation between dimensions. Third, simulated tests with dissimilar reliability were created by varying item parameters characteristics of the grade level tests. Multiple versions of twelve simulated tests were used to investigate UIRT vertical scaling assumption violations. The simulated tests were calibrated under the UIRT model to purposefully violate an assumption of UIRT vertical scaling. Each simulated test version was replicated for 1000 random examinee samples to assess the iv

8 bias and standard error of estimated grade-to-grade-growth, within-grade-variability, and separation-of-grade-distributions (effect size) of the estimated developmental score scales. The results suggest that when UIRT vertical scaling assumptions are violated the resulting estimated developmental score scales contain standard error and bias. For this study, the magnitude of standard error was similar across all simulated tests regardless of the assumption violation. However, bias fluctuated as a result of different types and magnitudes of UIRT vertical scaling assumption violations. More local item dependence resulted in more grade-to-grade-growth and separation-of-grade-distributions bias. And local item dependence resulted in developmental score scales that displayed scale expansion. Multidimensionality resulted in more grade-to-grade-growth and separationof-grade-distributions bias when the correlation between dimensions was smaller. Multidimensionality resulted in developmental score scales that displayed scale expansion. Dissimilar reliability of grade level tests resulted in more grade-to-gradegrowth bias and minimal separation-of-grade-distributions bias. Dissimilar reliability of grade level tests resulted in scale expansion or scale shrinkage depending on the item characteristics of the test. Limitations of this study and future research are discussed. v

9 TABLE OF CONTENTS LIST OF TABLES... viii LIST OF FIGURES... xi CHAPTER I. INTRODUCTION... 1 Item Response Theory Vertical Scaling... 1 Unidimensional IRT Vertical Scaling Assumptions... 2 Research Questions... 5 Educational Significance... 6 CHAPTER II. LITERATURE REVIEW... 8 Linking: Equating, Scaling, and Predicting... 9 Equating... 9 Scale Aligning Predicting Vertical Scaling Data Collection Designs Common Item Design Scaling Test Design Equivalent Groups Design Complications with the Designs Research on the Designs Vertical Scaling Methods Hieronymus Scaling Method Thurstone Scaling Method Item Response Theory Scaling Method Research Comparing the Scaling Methods Scale Shrinkage Scale Shrinkage Research Summary CHAPTER III. METHODOLOGY Structure and Characteristics of the Real Tests Design of the Real Tests Characteristics of Item and Ability Parameters Simulated Tests Structure of the Simulated Tests Creating the Simulated Tests Parameter Conditions of the Simulated Tests Calibration of the Simulated Tests Analysis of the Simulated Tests Summary CHAPTER IV. RESULTS Preliminary Analysis of Simulated Tests Convergence Rates of the Simulated Test Replications vi

10 Analysis of Simulated Tests First Research Question-Local Item Dependence Second Research Question-Multidimensionality Third Research Question-Dissimilar Reliabilities Overall Summary CHAPTER V. DISCUSSION AND CONCLUSIONS Discussion of Major Findings Simulated Test Replications Average Standard Error First Research Question-Local Item Dependence Second Research Question-Multidimensionality Third Research Question-Dissimilar Reliabilities Overall Conclusions Limitations and Future Research Within- Variability Similar Reliability of Level Tests Proficiency (Ability) Estimators Choice of Base Modeling Multidimensionality Interaction of Multiple Assumption Violations Scale Transformations Overall Summary REFERENCES APPENDIX A. BILOG CODE APPENDIX B. EVALUTION CRITERIA RMSE vii

11 LIST OF TABLES Table 1. First and third research questions: ability parameter condition First research question: item parameter conditions Second research question: ability parameter conditions Second research question: item parameter conditions Third research question: item parameter conditions Research questions proposed simulated tests Number of non-converged administered simulated test replications First research question: item discrimination parameter means and standard deviations First research question: item difficulty parameter mean and standard deviations First research question: item pseudo-guessing parameter means and standard deviations First research question: item testlet parameter means and standard deviations First research question: reliability means and standard deviations First research question: grade-to-grade-growth average absolute bias means and standard deviations First research question: within-grade-variability average bias means and standard deviations First research question: separation-of-grade-distributions average absolute bias means and standard deviations First research question: grade-to-grade-growth average standard error means and standard deviations viii

12 17. First research question: within-grade-variability average standard error means and standard deviations First research question: separation-of-grade-distributions average standard error means and standard deviations Second research question: item parameters means and standard deviations Second research questions: reliability means and standard deviations Second research question: grade-to-grade-growth average absolute bias means and standard deviations Second research question: within-grade-variability average bias means and standard deviations Second research question: separation-of-grade-distributions average absolute bias means and standard deviations Second research question: grade-to-grade-growth average standard error means and standard deviations Second research question: within-grade-variability average standard error means and standard deviations Second research question: separation-of-grade-distributions average standard error means and standard deviations Third research question: item discrimination parameter means and standard deviations Third research question: item difficulty parameter means and standard deviations Third research question: item pseudo-guessing parameter means and standard deviations Third research question: reliability means and standard deviations Third research question: grade-to-grade-growth average absolute bias means and standard deviations Third research question: within-grade-variability average bias means and standard deviations ix

13 33. Third research question: separation-of-grade-distributions average absolute bias means and standard deviations Third research question: grade-to-grade-growth average standard error means and standard deviations Third research question: within-grade-variability average standard error means and standard deviations Third research question: separation-of-grade-distributions average standard error means and standard deviations B1. First research question: grade-to-grade-growth RMSE means and standard deviations B2. Second research question: grade-to-grade-growth RMSE means and standard deviations B3. Second research question: grade-to-grade-growth RMSE means and standard deviations B4. Second research question: grade-to-grade-growth RMSE means and standard deviations B5. Second research question: within-grade-variability RMSE means and standard deviations B6. Third research question: within-grade-variability RMSE means and standard deviations B7. First research question: separation-of-grade-distributions RMSE means and standard deviations B8. Second research question: separation-of-grade-distributions RMSE means and standard deviations B9. Third research question: separation-of-grade-distributions RMSE means and standard deviations x

14 LIST OF FIGURES Figure 1. Visualization of common item design Visualization of scaling test design Visualization of equivalent groups design Illustration of the response probability curve for a 3PL UIRT item Observed and proposed grade-to-grade growth Second research question s simulated test structure First and third research question s simulated test structure An illustration of the process used to create a simulated test replication First research question: item discrimination parameter means First research question: item difficulty parameter means First research question: item pseudo-guessing parameter means First research question: item testlet parameter means First research question: reliability means True and observed grade-to-grade growth for replication1 and 2 increase LID simualted test True and observed within-grade variability for replication1 and 2 increase LID simualted test True and observed separation of grade distirubtions for replication1 and 2 increase LID simualted test First research question: grade-to-grade-growth average absolute bias means First research question: within-grade-variability average bias means First research question: separation-of-grade-distributions average absolute bias means xi

15 20. First research question: grade-to-grade-growth average standard error means First research question: grade-to-grade-growth average standard error means First research question: separation-of-grade-distributions average standard error means Second research question: item parameters means Second research question: reliability means Second research question: grade-to-grade-growth average absolute bias means Second research question: within-grade-variability average bias means Second research question: separation-of-grade-distributions absolute average bias means Second research question: grade-to-grade-growth average standard error means Second research question: within-grade-variability average standard error means Second research question: separation-of-grade-distributions average standard error means Third research question: item discrimination parameter means Third research question: item difficulty parameter means Third research question: item pseudo-guessing parameter means Third research question: reliability means Third research question: grade-to-grade-growth average absolute bias means Third research question: within-grade-variability average bias means Third research question: separation-of-grade-distributions average absolute bias means Third research question: grade-to-grade-growth average standard error means. 145 xii

16 39. Third research question: within-grade-variability average standard error means Third research question: separation-of-grade-distributions average standard error means A1. An example of the BILOG code xiii

17 1 CHAPTER I INTRODUCTION In educational measurement, a developmental score represents the performance of students along a continuum. As examinees advance from grade-to-grade, their performance is assessed along the developmental score scale. Vertical scaling is a methodology for creating a developmental score scale. The uses of developmental score scales have changed, in part, from the passage of the No Child Left Behind Act of 2001 (NCLB; Public Law ). Before the implementation of the NCLB, vertical scaling methods were used primarily for formative assessment such as the establishment of grade equivalent score scales for assessing examinees strengths and weaknesses (Yen, 2007). Pre-NCLB developmental score scales were used generally to make low-stakes decisions (Yen, 2007). NCLB requires States to measure their students performance and demonstrate the achievement of adequately yearly progress (AYP) relative to present proficiency standards, thereby demonstrating appropriate gains in educational achievement. For an illustration of this requirement, Title III of the NCLB Act states, A state shall approve evaluative measures that are designed to assess the progress of children in attaining English proficiency. As a result of this educational reform, the post-nclb uses of developmental score scales have shifted from low-stakes formative assessment to high-stakes summative assessment such as teacher evaluations and school and district performance (Harris, 2007; Yen, 2007). Item Response Theory Vertical Scaling Item response theory (IRT) vertical scaling (Reckase, 2010) is often used to establish developmental score scales. IRT vertical scaling methods were first used by CTB/McGraw-Hill in 1981 to vertically scale the Comprehensive Tests of Basic Skills, Form U (CTB/McGraw-Hill, 1981), and the California Achievement Tests, Form E (CTB/McGraw-Hill, 1985). However, IRT vertical scaling methods have sometimes

18 2 produced questionable developmental scores scales in particular when estimated gradeto-grade growth and within-grade variability were determined (Harris, 2007; Kolen & Brennan, 2004). It is important to investigate the causes of this instability in IRT vertical scaling methods, especially now that high-stakes decisions are being made based on the results from developmental score scales. The purpose of this research is to investigate the possible causes of these questionable IRT vertical scaling results. Unidimensional IRT Vertical Scaling IRT vertical scaling requires many methodological choices. Eight such choices include 1) model (unidimensional, multidimensional, etc.) 2) number of item parameters 3) calibration method 4) linear scale transformation if concurrent calibration is not used 5) base grade 6) scoring method 7) estimation method implemented within a calibration program 8) proficiency (ability) estimator. Different methods can produce different results making IRT vertical scaling a very complex process. Some of the problems with IRT vertical scaling may be associated with choices of methods as well as with the fit of the chosen psychometric model to the test data. The unidimensional IRT (UIRT) model has been often used as a psychometric model for vertical scaling. UIRT vertical scaling makes three strong assumptions: local item independence, unidimensionality, and similar reliabilities of grade level tests. Local item independence and unidimensionality are exclusive assumptions of UIRT (Lord, 1980), whereas similar reliability of grade level tests is an assumption of vertical scaling (Holland, 2007). It should be noted that the similar reliability of grade level test

19 3 assumption of vertical scaling is not universally held. However, for the purposes of this study the similar reliability of grade level test assumption of vertical scaling is considered. Violations of these three assumptions have been cited as reasons for the problematic IRT vertical scaling results (Camilli, 1988; Camilli, Yamamoto, & Yang, 1993; Sireci, Wainer, & Thissen, 1991; Wainer & Thissen, 1996; Yen, 1985; 1986; 1993). Local Item Independence Assumption Local item independence occurs when the probability of success, for an examinee with a given fixed ability level, on all n items equals the product of the probabilities of success on each item (Lord, 1980). This assumption is violated if an examinee s response to one item can influence his or her response to another item. An example is when a group of items (testlet) all require the use of the same graph, table, or passage. The examinee s understanding of the stimulus may influence his or her responses to the set of items, making those items dependent. Testlet IRT (TIRT) models have been proposed as an extension of the UIRT model where local item dependence is modeled by a testlet parameter (Wainer, Bradlow, & Wang, 2007). Unidimensionality Assumption Unidimensionality implies that examinees and items can be described by a single ability or trait (Yen & Fitzpatrick, 2006). This assumption is met when all of the items comprising a test measure the same underlying ability and examinees use only this ability to respond to the test. This assumption is a strong one given that some items may require the use of multiple abilities for a correct response. Accordingly, multidimensional IRT (MIRT) models that allow the consideration of multiple dimensions have been developed (Reckase, 2009). Reliability Assumption Reliability is defined as the ratio of true score variance to observed score variance. In IRT, true score variance can be defined as ability (theta) variance and

20 4 observed score variance can be defined as estimated ability (estimated theta) variance (Raju, Price, Oshima, & Nering, 2007),, (1) The marginal reliability can be found (Green, Bock, Humphreys, Linn & Reckase, 1984; Raju et al., 2007) as, (2) where is true score variance, is observed score variance, and is marginal error variance. IRT marginal error variance is the average inverse of test information, across the entire ability population. Given a particular ability level, test information is the sum of item information across all items. The test information for the three parameter logistic (3PL) UIRT model is, (3) where item information for item j is. (4) In these equations, D is a constant typically set equal to 1.7, u j is an examinee s response to item j (a value of 1 represented a correct response and 0 represents an incorrect response), P j is the probability of a correct response to item j, Q j is the probability of an incorrect response to item j, a j is discrimination parameter of item j, c j is the pseudoguessing parameter of item j, and θ i is the single ability parameter for examinee i. Therefore, IRT reliability is dependent upon the item parameters. The similar reliability of grade level tests assumption is met when the item parameters lead to marginal error variances that produce similar ratios of marginal error variance to observed score variance. Previous Research A considerable body of research has demonstrated that IRT vertical scaling methods can yield inconsistent results especially in estimating average grade-to-grade

21 5 growth and within-grade variability (Andrews, 1995; Becker & Forsyth, 1992; Bock, 1983; Camilli et al., 1993; Hoover 1984; Omar 1996; 1997; 1998; Seltzer, Frank, & Bryk, 1993; Tong, 2005; Tong & Kolen, 2007; Topczewski, 2012; Williams, Pommerich, & Thissen, 1998; Yen, 1983; 1986; Yen & Burket, 1997). Commonly cited reasons for these inconsistent results are local item dependence (Sireci et al., 1991; Wainer & Thissen, 1996; Yen, 1993), multidimensionality (Yen, 1985; 1986), and measurement error (marginal error variance) (Camilli, 1988; Camilli, et al., 1993; Yen, 1993). A limitation of previous research has been the lack of simulation studies with multiple replications. Multiple replications of a simulated test allow random and systematic error to be separated, quantified, and analyzed. The stability and accuracy of vertical scaling methods can be assessed. However, the results from simulation studies are limited if the generated data are unrealistic and unrepresentative of real tests. Therefore, simulated tests should model real tests as closely as possible. The aim of this dissertation is to use realistic simulated tests to investigate how violations of the three UIRT vertical scaling assumptions: local item independence, unidimensionality, and similar reliability of grade level tests affect estimated developmental score scales. Research Questions 1) When UIRT vertical scaling is used, what impact does local item dependence have on the developmental score scale? Specifically, when simulated tests vary in the amount of local item dependence, how are the UIRT vertically scaled developmental score scales impacted? 2) When UIRT vertical scaling is used, what impact does multidimensionality have on the developmental score scale? Specifically, when simulated tests are multidimensional, how are the UIRT vertically scaled developmental score scales impacted? 3) When UIRT vertical scaling is used, what impact does dissimilar reliability of grade level tests have on the developmental score scale? Specifically, when

22 6 simulated tests have grade level tests with dissimilar reliability, how are the UIRT vertically scaled developmental score scales impacted? Educational Significance Applying the same UIRT vertical scaling methods to simulated tests with multiple replications offers the possibility of separating out random and systematic error. Quantifying random and systematic error can be helpful in understanding error due to sampling or bias, respectively. These two types of error are useful in understanding the stability and accuracy of the UIRT vertical scaling methods. Systematic error directly affects accuracy and standard error directly affects stability. Systematic error is a serious problem because vertical scaling methods used might not capture the true developmental score scale. IRT is assumed to have sample-free item parameters and item-free ability parameters. When the vertical scaling methods have systematic error these invariant outcomes can be affected. However, simulated tests do not necessarily have the complexity of the real tests that a researcher hopes to understand. Modeling simulated tests as closely as possible to real tests may provide an understanding of what might be happening in real tests. Measures of examinee growth are becoming more common in state accountability programs (Reckase, 2010). One mechanism used to understand examinee growth is a developmental score scale. Vertically scaled assessments can be used to measure examinee growth and to better understand examinee learning. Yet for UIRT vertical scaling methods to yield useful estimated developmental score scale characteristics such as grade-to-grade growth and within-grade variability should appear reasonable. And not knowing the assumption violation implications of UIRT vertical scaling is troublesome. Local item independence, unidimensionality, and similar reliability of grade level tests assumptions are the foundation to UIRT vertical scaling methods. When these assumptions are violated, UIRT vertical scaling might lead to misleading score interpretations.

23 7 The goal of this dissertation is to investigate the effect of violations of UIRT vertical scaling assumptions on score scale characteristics of estimated developmental score scales. Characteristics such as grade-to-grade growth, within-grade variability, and separation of grade distributions are investigated. If it can be shown how these characteristics are affected by assumption violations, testing companies might use this information to increase the understanding of the error contained in their estimated developmental score scales. This information could be used as part of their processes of shaping an assessments development, refining or redefining the definition of the measured construct, and/or examining the purpose of the assessment.

24 8 CHAPTER II LITERATURE REVIEW Vertical scaling is a complex practice that requires many design and methodological decisions that affect the estimated developmental score scale. In order to understand the results of a vertical scaling, all of these decisions must be carefully examined. Before designs and methods are used, their assumptions must be considered and verified. All too often the model, design, and method choices are made without much prior thought. This may well lead to results that are not educationally relevant or are impractical. The effect of violations of assumptions, lack of model fit, and unrepresentative samples should be studied so that their impact on the results is known. Within IRT vertical scaling, research on the models, designs, and methods has not been studied to the point where the results are fully understood. Kolen and Brennan (2004) reviewed vertical scaling research and concluded, research suggests that vertical scaling is a very complex process that is affected by many factors. These factors likely interact with one another to produce characteristics of a particular scale. The research record provides little guidance as to what methods and procedures work best (p. 418). Progress has been made; many studies have been completed; but unusual results are still obtained, especially within-grade variability results (Kolen & Brennan, 2004; Harris, 2007). Therefore, this literature review focuses on the models, designs, and methods of vertical scaling, in particular IRT vertical scaling. This chapter reviews the research that has already been completed and highlights what is already known. This chapter consists of six sections. The first section reviews the definitions of equating, scale aligning, and predicting and how these fit under the definition of linking (Holland & Dorans, 2006). This section sets the premise of how vertical scaling fits within this larger context. The second section describes three vertical scaling data collection designs: common item, scaling test, and equivalent groups (Kolen & Brennan,

25 9 2004) and discusses research comparing these designs. The third section describes three vertical scaling methods: Hieronymus, Thurstone, and Item Response Theory (IRT) (Kolen & Brennan, 2004). Within the third section, IRT vertical scaling models, calibration methods, linear scale transformations, choice of base grade, scoring methods, estimation methods, calibration programs, and proficiency (ability) estimators are described and research findings are discussed. The fourth section highlights research comparing the three vertical scaling methods. The fifth section defines scale shrinkage and describes scale shrinkage research from 1980 to the present day. The sixth and final section briefly summarizes the literature review. Linking: Equating, Scale Aligning, and Predicting Vertical scaling fits under the broad umbrella of linking that includes three methods: equating, scale aligning, and predicting. Holland (2007) defined linking as refer[ring] to a general class of transformations between the scores from one test and those of another (p. 5). Each of these methods has different goals, but all of the methods link one test to another test. The restrictions of what defines each of the tests and characteristics of linking are what differentiate equating, scale aligning, and predicting. Equating Equating, according to Kolen and Brennan (2004), is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably (p. 2). Equating is accomplished only if strong requirements are satisfied. Dorans and Holland (2000) summarized five requirements for equating. The first two requirements are equal construct and equal reliability. These two requirements restrict test forms to be built to the same content and statistical specifications. The third requirement of symmetry requires that the equating function between two tests is the same as the inverse of that function (Dorans & Holland, 2000). This requirement excludes methods such as linear regression (Galton, 1888). The fourth requirement of equity, Lord s (1980) equity property of equating, requires that it should not matter which

26 10 test a student takes (Dorans & Holland, 2000). If equality is used as a criterion of same test specifications, then Lord himself concluded that the equity requirement cannot hold for fallible tests unless x and y are parallel tests, in which case there is no need for any equating at all (p. 196). The fifth and final requirement is population invariance (Dorans & Holland, 2000). Population invariance is satisfied when the equating function holds, no matter which population or subpopulation defines the equating function. When all five of these requirements are met the linking function can be classified as an equating function. Not all linking functions can satisfy these five strict requirements; some of the requirements may not be met between the two tests. When some but not all of the equating requirements are met the linking function is commonly classified as scale aligning. When few or none of the equating requirements are met the linking function is commonly defined as predicting. Thus if linking is thought of as a continuum, equating functions are on one side where all five requirements are met; scale aligning is somewhere in the middle; and predicting is on the other side where few or none of the requirements are met. Scale Aligning Scale aligning places the scores of two different tests on a common scale so that the scores may be comparable. Holland and Dorans (2006) divide scale aligning into two cases, (a) different constructs or (b) similar constructs but with different test specifications (p. 190). According to their definition, the first case clearly violates the first equating requirement (equal construct), and the second case partially violates the first and second requirements (equal construct and reliability). Battery scaling is an example of case (a) where each test in a battery measures a different construct, but the scores of each test are transformed to a common scale (Kolen & Brennan, 2004). Calibration, vertical scaling, and concordance fall within case (b) of scale aligning. Calibration is referred to by Holland (2007) as situations in which the tests measure similar constructs, have similar levels of difficulty, but have dissimilar

27 11 reliabilities (p. 18). However, the term calibration is often used (as it will be used here) as a method of estimating IRT parameters (Lord, 1980; Petersen et al., 1989; Kolen & Brennan, 2004). Concordance occurs when two tests measure similar constructs, have similar reliability and item difficulty, and are intended for similar populations (Holland, 2007). An example of a concordance is the relationship between the ACT and SAT exams. Both tests are commonly used as college entrance exams and a concordance between the two exams allows a comparison of scores. It should be noted that although comparisons are allowed, the interpretation of interchangeable scores is not afforded because the equating requirements were not met. Vertical scaling is a link between tests that have similar constructs and similar reliability, but dissimilar item difficulty and populations (Holland, 2007). Each test is meant for a different grade level (different populations) and typically a lower grade level test consists of easier items where a higher grade level test consists of more difficult items (dissimilar item difficulty). Vertical scaling places grade level tests of the same construct onto a common overall scale so that progress in a given subject, such as mathematics or reading, can be tracked over time (Holland & Dorans, 2006, p.192). The result of vertical scaling is a developmental score scale (Kolen & Brennan, 2004). It should be noted that, according to Holland s (2007) definition of vertical scaling, similar reliability of grade level tests is required. Predicting According to Holland (2007) the goal of predicting is to predict an examinee s score on one test from some other information about the examinee (p. 7). Demographic and/or background information, a composite score from a test, or another test can be used as a predictor according to Holland s (2007) definition. The absence of equating requirements makes predicting the weakest form of linking. An example of predicting is when high school GPA is used to predict performance on the SAT.

28 12 Vertical Scaling Data Collection Designs Vertical scaling methods are used to link grade level tests to the same developmental score scale. The data collection designs influence how vertical scaling methods are completed. There are three vertical scaling data collection designs: common item, scaling test, and equivalent groups (Kolen & Brennan, 2004; Kolen 2006). The common item and scaling test are the most commonly used designs. Common Item Design The common item design uses common items between adjacent grade level tests to construct a developmental score scale. Figure 1 displays an example of the common item design where grade level tests are displayed in the rows and the item groups within those tests are in the columns. An item group is a set of items that are always administered together. For this design about half of the items of a particular grade level test are in common with an adjacent grade level test. However, it is not necessary for all the items to be in common with adjacent grade level tests. Scaling Test Design The scaling test design uses a single test that covers the content for the entire span of grades. This scaling test is meant to be practical in length where the entire test can be finished in one testing session. In addition to the scaling test, every student takes an appropriate grade level test. Figure 2 shows the scaling test design where a scaling test is given to all grades and grade level tests have common items between adjacent grades. The scaling test creates the developmental score scale and each grade level test is linked to the scaling test. Therefore, operationally, students need only take a single grade level test. Equivalent Groups Design In the equivalent groups design students are randomly assigned to take an appropriate grade level test or an adjacent grade level test (higher or lower). The adjacent upper grade level test may be too challenging for the particular grade causing floor

29 13 effects. Thus, students for a particular grade typically take a grade level test that is grade appropriate or lower. A version of the equivalent groups design is seen in Figure 3, where item groups in dashed in boxes are appropriate for the adjacent lower grade and item groups in bold-lined boxes are grade appropriate. Students in each grade are randomly assigned to take one of the two grade level tests shown for their grade, creating equivalent groups. The developmental score scale is created by using a chaining process across grades where grade level tests are placed onto a common scale. Complications with the Designs As shown in Figures 2 and 3, the scaling test and equivalent groups designs can have a common item test design to serve as a backup if the primary design is found to be flawed. If the potential pitfalls of each of the designs are known, a researcher can confirm the validity of the design. As mentioned before, for the equivalent groups to be valid, the randomization process for each grade must create equivalent groups. However, if equivalence is found to be violated and there are no common items present in the design, the design as a whole fails and data may need to be recollected. For the scaling test to be valid, the scaling test given to all grades must be challenging enough for the upper grades but not too hard for the lower grades. This may be a problem for certain content areas tied closely to school curriculum. For, if correctly answering an item depends largely on whether a student has been taught a certain concept, it will be difficult to build a well-functioning scaling test. This is why scaling tests are often used with content areas that follow the domain definition of growth. The domain definition of growth states that growth is tied less to curriculum and is more of a progression through a content area (Kolen & Brennan, 2004). The common item design uses the grade-to-grade definition of growth where growth is more directly tied to the curriculum (Kolen & Brennan, 2004). A limitation can be over or under inflation of growth if the design (scaling test or common items) does not match the definition of growth for the content area.

30 14 The three designs have their pros and cons. The common item design is the easiest to administer. This ease is not the case with the other designs. The equivalent groups design must have spiraled forms within a grade, and the scaling test design have two tests constructed and administered. The common items design puts a great deal of weight on the soundness of the common items, and these common items must be considered when a test is first constructed. Context effects can occur and if the position of the item is changed the item characteristics can change (Yen, 1980). Items at the end of a speeded test can appear to be more difficult than if those same items were presented at the beginning of the test (Yen, 1980) (see Yen (1980) for a more complete coverage of context effects). In the common item design shown, the item block positions change, as seen in Figure 1. An item block that appears at the end of the third grade test may appear at the beginning of the fourth grade test. In this design if the test is greatly speeded for both grades, the amount of growth that is seen from third to fourth grade on item block B will be a combination of actual growth and the context effect of the items. These two effects are not possible to separate given the common item design. Research on the Designs The research on these designs is limited mostly to comparisons between the common item and the scaling test designs (Andrews 1995; Hendrickson, Kolen, & Tong, 2004; Mittman, 1958; Tong, 2005; Tong & Kolen 2007). The real test results investigating the influence of design on grade-to-grade growth, within-grade variability, and separation of grade distributions (effect size) are mixed. The common item design was found to produce more growth (Andrews 1995; Tong, 2005; Tong & Kolen 2007) or about the same as the scaling test (Hendrickson et al., 2004). Within-grade variability was seen to increase for both the common item and scaling test designs (Mittman, 1958), decrease for the common item design (Tong & Kolen, 2007; Tong 2005), or remain flat for the scaling test design (Tong 2005; Tong & Kolen, 2007). More separation of grade distributions was seen for the common item design (Tong 2005; Tong & Kolen, 2007)

31 15 and scaling test design (Andrews, 1995). Tong (2005) and Tong and Kolen (2007) used simulation techniques to study the common item and scaling test designs and found that both designs generally recovered the true underlying developmental score scale. These simulation results may suggest that the common item and scaling test designs do not greatly affect the resulting scale and something else may be impacting the real data results. Vertical Scaling Methods Three vertical scaling methods are Hieronymus, Thurstone, and Item Response Theory (IRT). These methods can be used with any of the three data collection designs previously discussed (Kolen & Brennan, 2004). These three vertical scaling methods use an interim score scale. Once this interim score scale is created the developmental score scale then can be transformed to standard scores. The process of creating the interim and standard developmental score scale for each of the scaling methods is further described. Hieronymus Scaling Method The Hieronymus scaling method was first developed by A.N. Hieronymus at the University of Iowa and was used to scale the Iowa Tests of Basic Skills Form 1 and 2 (Petersen et al., 1989). This method relies on the use of number correct scores or points of dichotomously or polytomously scored items. In Hieronymus scaling the first step is to put the number correct scores of all students from all grade levels onto a single interim scale. When a scaling test design is used, the interim scale is created by rank ordering all students on the scaling test. When the common item design or equivalent groups design is used the interim scale is created by linking all the grade levels tests. For all of the designs, the interim scale is then transformed to a developmental score scale that has the final desired properties. Some desired properties could be scale score distributions that increase in variability as grades increase as seen in the Iowa Tests (Hoover, Dunbar, & Frisbie, 2003). A final step is only needed when the scaling test design is used. This final step links the grade level tests to the scaling test.

32 16 Thurstone Scaling Method The Thurstone scaling method was first developed by Thurstone in 1925 (Thurstone, 1925), modified in 1938 (Thurstone, 1938), and further modified in 1950 by Gulliksen (Gulliksen, 1950, p. 284). In the 1960 s and 1970 s the Thurstone scaling method was the predominate vertical scaling method and was used by testing companies such as CTB/McGraw-Hill, The Psychological Corporation, and the Science Research Associates (Yen, 1986). The Thurstone scaling method, also referred to as Thurstone s absolute scaling method, makes the assumption that number correct scores within each grade level and over all grades are normally distributed (Thurstone, 1938; Gulliken, 1950, p. 284). The Thurstone scaling method consists of three main steps (Kolen & Brennan, 2004). The first step specifies the mean and standard deviation of a normal distribution for each grade. The second step transforms all raw scores to normalized scores for each grade level frequency distribution. The final step links the first and second steps by finding the relationship between the raw scores and scale scores using normalizing transformations. The end result is a developmental score scale where each grade is normally distributed and all grades are linearly related. Item Response Theory Scaling Method The Item Response Theory scaling method uses calibration to place a test s item and ability parameters onto a single developmental score scale. A single calibration (concurrent calibration) or many calibrations (separate calibrations) with linear transformations place item and ability parameters on the same scale. This scale may be used as the final developmental score scale or additional transformations may be performed to create the final developmental score scale. Item Response Theory Models An IRT model s mathematical form relates the probability of obtaining a correct response with an examinees level of ability (Lord, 1980). The items must be scrutinized to determine which model appropriately fits the data. Dichotomously (scored as either

33 17 correct or incorrect) and polytomously (scored on a scale of correct to incorrect) scored items change the form of the model. The occurrence of local item dependence, multidimensionality, and/or the number of specified parameters all can affect the form of the model. For the three-parameter logistic (3PL) UIRT model, the probability of person i correctly answering dichotomously scored item j is, (5) where D is a constant typically set equal to 1.7, a j is the item discrimination parameter, b j is the item difficulty parameter, c j is the item pseudo-guessing parameter, and θ i is the single ability parameter. The a j parameter is an index of item discrimination. Larger values indicate the item is more precisely differentiates if a person will correctly or incorrectly answer the item. The b j parameter is an index of item difficulty, where a small value indicates the item is easy and a large value indicates the item is difficult. The c j parameter is an index of pseudo-guessing, which is interpreted to be the probability of a person with an infinitely low ability getting the item correct by purely guessing. The item format typically dictates whether a parameter appears in the model. For example, if the item is a multiple choice item, it is difficult to argue that guessing does not occur. If the item requires a short answer, it would be unlikely the student would correctly answer the item by chance and a pseudo-guessing parameter might not be included. These three parameters differentiate the 1PL/Rasch, 2PL, and 3PL models. The 3PL model contains all the three previously mentioned parameters. The 2PL model can be thought of a special case of the 3PL model where the pseudo-guessing parameter is set equal to zero for all items (de Ayala, 2009). The 1PL/Rasch model is a special case of the 3PL model where the item discrimination parameter is set equal to one (Rasch) or some constant (1PL) and the pseudo-guessing parameter is set equal to zero for all items (de Ayala, 2009).

34 18 A fourth parameter, index of slippage, has been proposed. This parameter is interpreted as the probability of a person with a very high ability getting the item incorrect by accident (memory problem, mis-answered, ect.) (Barton & Lord, 1981; McDonald, 1967). This parameter has not been widely studied and is not examined here further. Unidimensional Model The mathematical forms (3PL, 2PL, and 1PL/Rasch) use a model to relate the item and ability parameters to a specific response function. The UIRT model for dichotomously scored data was presented in Equation 1 in Chapter 1. Figure 4 gives an illustration of the 3PL UIRT model to further describe the parameters. The item pseudoguessing parameter, c j,, is equal to 0.2 which means a examinee with an infinitely low ability has a 20 percent chance of correctly answering the item. The item difficulty parameter, b j, corresponds to the ability at the inflection point of the response probability curve (Lord, 1980). The inflection point occurs at the (1+ c j )/2 probability. For this item, the inflection point occurs when the probability is 0.6, which corresponds to an ability of one. The item discrimination parameter, a j, corresponds to the slope [0.425a j (1- c j )] of the response probability curve at the point of inflection (Lord, 1980). This item has a slope of 0.34 at the inflection point/ability of 0.6. Testlet Model When specifying the IRT model the assumptions of the model must be considered. As seen with the UIRT model, the assumptions of local item independence and unidimensionality must be met. When items are grouped into testlets (i.e. on a reading test where there are common passages for multiple items (Yen & Fitzpatrick, 2006)) the assumption of local item independence is most likely violated. A testlet model may be more appropriate than a unidimensional model. When the local item independence assumption of the 3PL UIRT model is relaxed the data can be modeled by the 3PL TIRT model,

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

An Investigation of vertical scaling with item response theory using a multistage testing framework

An Investigation of vertical scaling with item response theory using a multistage testing framework University of Iowa Iowa Research Online Theses and Dissertations 2008 An Investigation of vertical scaling with item response theory using a multistage testing framework Jonathan James Beard University

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,

More information

A Comparison of Four Test Equating Methods

A Comparison of Four Test Equating Methods A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses 2015 Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating

More information

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This

More information

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review Results & Statistics: Description and Correlation The description and presentation of results involves a number of topics. These include scales of measurement, descriptive statistics used to summarize

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Using the Score-based Testlet Method to Handle Local Item Dependence

Using the Score-based Testlet Method to Handle Local Item Dependence Using the Score-based Testlet Method to Handle Local Item Dependence Author: Wei Tao Persistent link: http://hdl.handle.net/2345/1363 This work is posted on escholarship@bc, Boston College University Libraries.

More information

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Gender-Based Differential Item Performance in English Usage Items

Gender-Based Differential Item Performance in English Usage Items A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Exploring dimensionality of scores for mixedformat

Exploring dimensionality of scores for mixedformat University of Iowa Iowa Research Online Theses and Dissertations Summer 2016 Exploring dimensionality of scores for mixedformat tests Mengyao Zhang University of Iowa Copyright 2016 Mengyao Zhang This

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Multidimensionality and Item Bias

Multidimensionality and Item Bias Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on

More information

Effects of Local Item Dependence

Effects of Local Item Dependence Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model Wendy M. Yen CTB/McGraw-Hill Unidimensional item response theory (IRT) has become widely used

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ Linking Mixed-Format Tests Using Multiple Choice Anchors Michael E. Walker Sooyeon Kim ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association (AERA) and

More information

Discrimination Weighting on a Multiple Choice Exam

Discrimination Weighting on a Multiple Choice Exam Proceedings of the Iowa Academy of Science Volume 75 Annual Issue Article 44 1968 Discrimination Weighting on a Multiple Choice Exam Timothy J. Gannon Loras College Thomas Sannito Loras College Copyright

More information

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

Copyright. Kelly Diane Brune

Copyright. Kelly Diane Brune Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person

More information

IRT Parameter Estimates

IRT Parameter Estimates An Examination of the Characteristics of Unidimensional IRT Parameter Estimates Derived From Two-Dimensional Data Timothy N. Ansley and Robert A. Forsyth The University of Iowa The purpose of this investigation

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS by LAINE P. BRADSHAW (Under the Direction of Jonathan Templin and Karen Samuelsen) ABSTRACT

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of

More information

Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

AN EXPLORATORY STUDY OF LEADER-MEMBER EXCHANGE IN CHINA, AND THE ROLE OF GUANXI IN THE LMX PROCESS

AN EXPLORATORY STUDY OF LEADER-MEMBER EXCHANGE IN CHINA, AND THE ROLE OF GUANXI IN THE LMX PROCESS UNIVERSITY OF SOUTHERN QUEENSLAND AN EXPLORATORY STUDY OF LEADER-MEMBER EXCHANGE IN CHINA, AND THE ROLE OF GUANXI IN THE LMX PROCESS A Dissertation submitted by Gwenda Latham, MBA For the award of Doctor

More information

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS By Jing-Ru Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements

More information

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design APPLIED MEASUREMENT IN EDUCATION, 22: 38 6, 29 Copyright Taylor & Francis Group, LLC ISSN: 895-7347 print / 1532-4818 online DOI: 1.18/89573482558342 Item Position and Item Difficulty Change in an IRT-Based

More information

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock 1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots Correlational Research Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1 Correlational Research A quantitative methodology used to determine whether, and to what degree, a relationship

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS. Brian Dale Stucky

LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS. Brian Dale Stucky LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS Brian Dale Stucky A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp.

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. Traditional test development focused on one purpose of the test, either ranking test-takers

More information

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL) EVALUATION OF MATHEMATICS ACHIEVEMENT TEST: A COMPARISON BETWEEN CLASSICAL TEST THEORY (CTT)AND ITEM RESPONSE THEORY (IRT) Eluwa, O. Idowu 1, Akubuike N. Eluwa 2 and Bekom K. Abang 3 1& 3 Dept of Educational

More information

March 2007 SEC STATE PLANS. (b) ACADEMIC STANDARDS, ACADEMIC ASSESSMENTS, AND ACCOUNTABILITY. - (2) ACCOUNTABILITY. -

March 2007 SEC STATE PLANS. (b) ACADEMIC STANDARDS, ACADEMIC ASSESSMENTS, AND ACCOUNTABILITY. - (2) ACCOUNTABILITY. - Proposed Legislative Language for the No Child Left Behind Reauthorization Submitted by the Conference of Educational Administrators of Schools and Programs for the Deaf (CEASD) March 2007 In order for

More information

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that it purports to perform. Does an indicator accurately

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS DePaul University INTRODUCTION TO ITEM ANALYSIS: EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS Ivan Hernandez, PhD OVERVIEW What is Item Analysis? Overview Benefits of Item Analysis Applications Main

More information

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models. Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human

More information

Impact of Methods of Scoring Omitted Responses on Achievement Gaps

Impact of Methods of Scoring Omitted Responses on Achievement Gaps Impact of Methods of Scoring Omitted Responses on Achievement Gaps Dr. Nathaniel J. S. Brown (nathaniel.js.brown@bc.edu)! Educational Research, Evaluation, and Measurement, Boston College! Dr. Dubravka

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

Advanced Dental Admission Test (ADAT) Official Results: 2017

Advanced Dental Admission Test (ADAT) Official Results: 2017 Advanced Dental Admission Test (ADAT) Official Results: 2017 Normative Period: April 3 through August 31, 2017 Number of ADAT Candidates: 481 Number of ADAT Administrations: 483 Report Date: September

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information