POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

Size: px
Start display at page:

Download "POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS"

Transcription

1 POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION UNIVERSITY OF FLORIDA

2 2010 Ou Zhang 2

3 To my Dad who has supported me, believed me in, and encouraged me to start this long way. He is my hero! 3

4 ACKNOWLEDGMENTS I would like to express my sincere appreciation to Dr. M. David. Miller, my committee chair, for providing valuable guidance and continuous support. I would also like to thank Dr. James J. Algina, my committee member, for sharing his ideas and corrections on this project. My deepest gratitude goes to my parents and my wife, Bei Li, for their constant support and love. Thanks to my summer internship mentor, Dr. Feiming Li and Vice President, Dr. Linjun Shen for giving me such a valuable opportunity to enter the educational measurement industry. Thanks to my friend Yan Cao for her patience and help over years. Last, thanks go out to Dr. Andrich for his comment and suggestion. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS... 4 LIST OF TABLES... 7 ABSTRACT... 9 CHAPTER 1 INTRODUCTION Model Selection Survey of the Testlet Size in Applications of Testlet Purpose of the Study LITERATURE REVIEW Item Response Theory IRT Assumptions One-Parameter Logistic Model (1-PL Model or Rasch Model) Polytomous Item Response Theory (IRT) Model-Partial Credit Model Testlet Model-Rasch Testlet Model Local Item Dependence Reliability Survey in Application of Testlet METHOD Model Used to Generate Data Population Parameters Condition Manipulated Data Generation Parameter Estimation Ability Estimation Analysis Bias Root Mean Square Error (RMSE) Reliability RESULTS MLE Non-convergence Issue Test Reliability Standard Error of Measurement Bias and RMSE

6 4.5 An Empirical Case DISCUSSION General Discussion Limitations and Suggestions for Future Research Conclusion LIST OF REFERENCES BIOGRAPHICAL SKETCH

7 LIST OF TABLES page 1-1 Testlet size in the article reviews The number of testlets in the dataset Test length in the reviewed articles Sample sizes in the reviewed articles Fit indices in reviewed articles Estimation method in reviewed articles The number of simulation replication applied in the reviewed articles Study design condition with 3 factors MLE nonconvergence case and rate per condition-testlet size MLE nonconvergence case and rate per condition-testlet size Test reliability-testlet size 3 conditions Test reliability-testlet size 5 conditions Testlet size 3 the results of the Spearman-Brown prophecy Testlet size 5 the results of the Spearman-Brown prophecy Mean standard error of measurement for each condition (testlet size 3) Mean standard error of measurement for each condition (testlet size 5) Testlet size 3 Bias and RMSE of ability estimate recovery (EAP) Testlet size 5 Bias and RMSE of ability estimate recovery (EAP) Rasch testlet model (Testlet Size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals Partial credit model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals Standard Rasch model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals

8 4-14 Rasch testlet model (Testlet Size 5) Bias of Ability ( ) Estimate Recovery (EAP) with 6 Different Ability Intervals Partial credit model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals Standard Rasch model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals Rasch testlet Model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals Partial credit model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals Standard Rasch model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals Rasch testlet model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals Partial credit model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals Standard Rasch model (testlet size 5) RMSE of ability ( ) estimate recovery with 6 different ability intervals NBOME LEVEL-2 Block 1 Item WMSE COMLEX-Level block-1 local item dependence detection results

9 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Arts in Education POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By Ou Zhang December 2010 Chair: M. David Miller Major: Research and Evaluation Methodology This study investigated the effectiveness of ability parameter recovery for three models to detect the influence of the local item dependence across testlet items under the small testlet size situation. A simulation study was used to compare three Rasch type models, which were the standard Rasch model, the partial credit model, and the Rasch testlet model. The results revealed that both the partial credit model and Rasch testlet model performed better than the standard Rasch model as the existence of local item dependence within testlet. The results also indicated that as the sample size increases, the discrepancies between model estimates and the real data set increases. The study concluded that using the polytomous IRT model for testlet item analyses is still efficient for small testlet size and non-adaptive typed tests. Moreover, for small testlet sizes, polytomous IRT models are more stable than the Rasch testlet model when there are a large number of the testlets included in a test. In sum, the polytomous IRT model and Rasch testlet model offers an advantage over the standard Rasch model as it avoids standard error of measurement underestimation and better ability parameter estimations in the small testlet size situations. 9

10 CHAPTER 1 INTRODUCTION Item response theory (IRT) models are commonly used in educational and psychological testing. Employing item response theory allows for assessing latent human characteristics and quantifying underlying traits. IRT holds a major assumption - - local item independence. Local item independence (LID) assumes items in the test are unrelated with each other, after controlling for the underlying trait. However, the LID assumption can be commonly violated in real world applications. In fact, many real world tasks require solving related problems or solving a single problem in stepwise fashion. In accordance with such circumstances, the exam includes items within a subset sharing a single content stimulus. The items sharing the same stimuli are grouped as a unit, termed as an item bundle (Rosenbaum, 1988) or testlet (Wainer & Kiely, 1987). An item bundle or testlet, hence forward referred to as a testlet, is a scoring unit within a test that is smaller than a test (Wainer & Kiely, 1987). Items within testlets are locally dependent because they are associated with the same stimulus. Moreover, local item dependence introduces unintended dimensions into the test at the construct of interest s expense (Wainer & Thissen, 1996). Thus, the challenge for the test developer is not to eliminate the item dependencies, but rather to find a proper solution so that such local item dependence does not impact the test reliability and the validity of inferences from the test. More specifically, the violation of the assumption of local item independency may lead to an underestimate of the standard errors and could result in (a) bias in item difficulty estimates, (b) inflated item discrimination estimates, (c) overestimation of the precision of examinee scores, and (d) overestimation of test 10

11 reliability and test information. This last result can lead to inaccurate inferences that may result in a greater chance of misclassification when making decisions regarding examinee ability categorization (Sireci, Thissen, & Wainer, 1991; Yen, 1993). Therefore, some models were proposed as solutions to the violation of the local item independence assumption. One of the methods is to treat such testlet items as a single super polytomous item in the analysis (Sireci, Thissen, & Wainer, 1991; Thissen, Steinberg, & Mooney, 1989; Wainer, 1995). This method leans heavily on Rosenbaum s theorem of item bundles (Rosenbaum, 1988) using a polytomous (IRT) model to score the locally independent testlets. The key idea is that the items that form each testlet may have excessive local dependence, but that once the entire testlet is considered as a single unit and scored polytomously these local dependencies may disappear. The item scores are summed within each testlet. When the total scores in a testlet are identical, they will be assigned to the same category. This method allows researchers to score testlets polytomously. Once the summed item scores are obtained, testlet type item responses are calibrated by applying polytomous item response models, such as the Graded Response Model (Samejima, 1969), the Partial Credit Model (Masters, 1982), the Rating Scale Model (Andrich, 1978), or the Nominal Response Model (Bock, 1972). In using a polytomous IRT model to score testlets, the data can be analyzed while maintaining local independence across different testlets. This approach avoids the overestimation of the test reliability and information so that the statistics of the polytomous IRT model consistently perform better than the standard Rasch model in such circumstance. 11

12 However, this approach has some weaknesses when it is applied to manipulate testlet-type data set. Some major shortcomings of polytomous IRT models have been discussed (Thissen, Billeaud, McLeod, & Nelson, 1997; Yen, 1993; Wainer & Wang 2000). First, when polytomous IRT models are applied, some test information, the precise pattern of responses the examinee generates, is lost. In addition, some parameters are dropped from the polytomous model compared to the individual dichotomous item-scoring. Third, it is inappropriate if the test is administered adaptively. Last but not least, the test reliability might be underestimated (Yen, 1993). Wainer (1995) claimed that using a polytomous IRT model to manage testlets might be appropriate when the local dependence between items within a testlet is moderate and the testlet-type items only take a small proportion of the entire test. The other method, the testlet model (Wainer & Kiely, 1987), is explicitly introduced as an alternative to the polytomous IRT model and attempts to solve the same problem. IRT testlet models have been proposed in which a random effect parameter is added to model the local dependence among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wang, Bradlow, & Wainer, 2002). As one random effect parameter is added to the model, an additional latent trait is also added to the testlet model. Thus, the testlet model proposed by Wainer& Wang (2000) is a special case of a multidimensional IRT model (MIRT). Proposed by Wang and Wilson (2005b), the Rasch testlet model is a special case of testlet model (Wainer & Wang, 2000) by combining special features of the Rasch model and the testlet model. In so doing it makes use of several desirable measurement and psychometric properties of the Rasch model (Wang & Wilson 2005). First, the 12

13 Rasch model has observable sufficient statistics for the model parameters and a relatively small sample size requirement for parameter estimation. Second, no distributional assumption on the item parameters is necessary in Rasch models since the items are treated as fixed effects. Therefore, the Rasch model is widely applied in testing and scoring. Because of such advantages of the Rasch model, Wang and Wilson (2005a, 2005b) showed that it is possible to model locally dependent items in relation to testlets by using a Rasch testlet model so that more precise and adequate estimates are obtained. The Rasch testlet model is the special case of the testlet model (Wainer &Wang, 2000). Before the testlet model was proposed, polytomous IRT models were the primary method to analyze testlets. Currently, both approaches are widely used for testlet analyses and no doubt, both approaches have pros and cons. Thus, the theoretical reason to choose the testlet model (Wainer &Wang, 2000) over polytomous IRT model in testlet analysis might be obvious. However, some potential caveats of the testlet model should also be considered. First, the testlet model is more complex than both the standard IRT model and the polytomous IRT model because it adds more testlet parameters. Second, when one testlet parameter is added in the model, an additional latent trait is also added in the model so that multidimensionality occurs and results in increased complexity of analysis. Thus, the model analysis process is extremely prolonged and some potential issues emerge (e.g. sometimes the model calibration fails to converge). Therefore, the benefits in using the testlet model (Wainer& Kiely, 1987) should be weighed against the added complexity in data analysis. 13

14 1.1 Model Selection Although Wainer and Wang (2000) addressed advantages of the testlet model over the polytomous model in applied testlet analyses, it remains important to compare these two models under various conditions. Pitt, Kim, and Myung (2003) indicated that the goal of model selection was not just to find the model that provide the maximum fit to a given data set, but to identify a model from a set of competing models, that best captures the characteristics or trends underlying the cognitive process of interest. Briefly, the best model is the model that matches the purpose of the study and can explain all of the important features of the actual data without adding unnecessary complexity. The model s analysis efficiency is the other issue that must be considered. Two realistic circumstances should be noted in model selection for analyzing testlets. First is that of testlet size. In previous testlet research, in order to obtain illustrative results to support hypotheses, testlet sizes were usually set from 5 to 10 or more items (e.g., Adams,Wilson & Wang,1997; Wang & Wilson, 2005; Brandt, 2008; Wainer & Wang, 2000; Wainer & Lewis, 1990). Small and medium testlet sizes (2-4 items) were rarely applied (Ip, Smits & De Boeck,2009; Tokar, Fischer, Snell & Harik-Williams, 1999; DeMars,2006). This is potentially problematic because in some exams, like National Board of Osteopathic Medical Examiner (NBOME) s Comprehensive Osteopathic Medical Licensing Examinations (COMLEX)-USA exam, testlet sizes are often small. The second issue to consider is that of non-adaptive tests. Non-adaptive tests are still widely used in the educational and psychological measurement field. Because of the small testlet sizes and non-adaptive features, the loss of response pattern information is not that serious for this kind of tests. In this manner, more concerns are given for the 14

15 model comparison between two models when the aforementioned shortcomings of polytomous IRT model applying in testlet analysis are minimal. In addition, the local dependence effect ( d (i) ) within testlet varies. Since the local dependence effect ( d (i) ) is avoided when polytomous IRT models are applied, the extent to which the local dependence effect ( d (i) ) influences the fit of polytomous model in testlet analysis should also draw attention. 1.2 Survey of the Testlet Size in Applications of Testlet Very little research has focused on model comparison between the polytomous IRT model and the testlet model initially proposed by Wainer and Wang (2000) regarding model fit, ability parameter recovery, and test reliability as testlet conditions change, especially when testlet size and local dependence effect ( d (i) ) are at a medium level. A review of the literature to identify the application of testlets was conducted in the EBSCO Host and PsychInfo databases for the keywords testlet and testlets to identify studies that included testlet characteristics in the test between 1989 and A total of fifty-five articles relevant to the testlet were found and reviewed (see the reference list from Appendix B).Among all fifty-five testlet-related articles, fortyfive articles have specific descriptions regarding the factors that could influence the testlet analysis in testlet research designs (i.e. testlet size, the number of testlets within a test, sample size, etc.). The remaining ten articles, which include two book reviews, conceptually describe testlet theory and application. Issues of testlet size within the testlet have been well documented in the literature. In these forty-five testlet-relevant articles, only four articles solely applied small testlet size designs (i.e. testlet size smaller than five). The other forty-one articles have a 15

16 mixture of testlet size designs, although most of the articles included moderate and large testlet size designs (testlet size larger than 5) in their research. Of these forty-one, there were twelve articles that considered the small testlet size designs. Over thirty-five articles included the testlet sizes between 5 and 10 and twelve articles included large testlet size conditions, larger than 10. Overall, 16 articles (35.6%) investigated small testlets and only 12 compared the small and medium testlet sizes. The detailed results are shown in Table 1-1. In sum, this study adds to this literature by investigating the results of three different models of testlet-type data under the small and medium testlet size circumstances (i.e. testlet size small than or equal to 5). Testlet size, local dependence effect, sample size, and the ratio of testlet/independent items are factors in this study. We examine model fit, test reliability, and the ability parameter recovery of the three different models (i.e., Rasch model, Partial Credit model, and Rasch testlet model) employed in a testlet-type data analysis. 1.3 Purpose of the Study In accordance with previous testlet research, one of the research purposes inherent to this study is exploring the consequences of variation in testlet size and local dependence effects on test reliability, standard error of measurement, and ability parameter recovery of the standard Rasch model, the Partial Credit model, and the Rasch testlet Model. By looking for the trend of how changes in testlet factors (i.e. testlet size, local dependence effect, sample size, testlet/independent item ratio) affect different models estimations and the test reliability corresponding to the models, a guide for model selection is expected to emerge. 16

17 The other essential goal of this study is to determine which model performs the best at person ability parameter recovery by considering the trade-off of the test reliability and analysis complexity. An answer to these questions will be useful to provide evidence as a reference for researchers interested in applying IRT models to measure tests appropriately. Furthermore, since we use data from the NBOME COMLEX-USA examination, it will provide guidance for future improvements in the estimation of this exam. 17

18 Table 1-1. Testlet size in the article reviews Testlet size (m) m <5 5<m<10 11<m<15 16<m<20 21<m<25 m>25 articles proportion 35.56% 77.78% 13.33% 4.44% 6.67% 2.22% Note: m is the number of items in the testlet 18

19 CHAPTER 2 LITERATURE REVIEW In this section, the theoretical framework of this research is given. Several important parts are included: IRT theory, IRT assumption, IRT models used in this research, local item dependence, and test reliability. 2.1 Item Response Theory Item Response Theory (IRT), proposed by Lord (1952), is a family of statistical models for analyzing item responses in a population of individuals. It depicts the relationship between examinees and items through mathematical models (Wainer & Mislevey, 2000). Many mathematical models can be developed within the IRT framework. There are two general types of IRT models, dichotomous IRT models and polytomous IRT models. Dichotomous IRT models are used to model items with only correct or incorrect response option. One-Parameter Logistic (1PL), Two-Parameter Logistic (2PL), and Three-Parameter Logistic (3PL) IRT models are three common dichotomous IRT models. Items with more than two response options can be modeled with polytomous IRT models. Among the polytomous IRT models already suggested, examples of polytomous IRT models include the Graded Response model (GRM; Samejima, 1969), the Rating Scale model (RSM; Andrich, 1978), the Partial Credit model (PCM; Masters,1982), the generalized Partial Credit model (GPCM; Muraki, 1992), and the Nominal Response model (NRM; Bock, 1972). The noticeable feature of IRT over classical test theory is that IRT models are invariant to item and ability parameters (Hambleton, Swaminathan & Rogers, 1991). According to this invariance feature, item parameters (e.g., difficulty, discrimination and guessing) are not dependent on the 19

20 ability distribution of any particular group of examinees and the examinee ability parameters (θs) are not dependent on a specific set of test items IRT Assumptions Two essential a priori assumptions are held by Item Response Theory. The first assumption of IRT is local item independence: the probability of a correct response to one item is independent from other items. Local item independence means that the item responses are independent for a given value of latent trait. The joint probability of a response pattern for all items in the test is the product of the probabilities of correct responses to the items for a given latent trait. N P ( X 1 ) P( 1 ) (2-1) where N is the total number of items. The second assumption of most general IRT models (e.g. 1PL-, 2PL-, 3PLmodels) is unidimensionality. Early notions of IRT require that the same construct should be measured by all test items (Loevinger, 1947). As such, all items in the test only measure a single latent trait (Hambleton & Murray, 1983; Lord, 1980) One-Parameter Logistic Model (1-PL Model or Rasch Model) The Rasch model (Rasch, 1960) is the simplest of unidimensional models. The Rasch model predicts the probability of success for person j on itemi and can be given by the formula: i X i P( y ij exp( j bi ) 1 j, bi ) (2-2) 1 exp( b ) j i where yij is examinee j s response category to item i ; 20

21 j is examinee j s proficiency level; bi is the difficulty parameter of item i, which indicates the point on the ability continuum when an examinee has a 50% probability of answering item i correctly. P y ij 1, b ) is the probability that examinee j answers item i correctly, by given ( j i proficiency level ; j An assumption that is implicit in the model is that all items have the same discrimination value Polytomous Item Response Theory (IRT) Model-Partial Credit Model In this study, for comparison with the standard Rasch model, the Partial Credit model was selected as the polytomous IRT model. The Partial Credit model (PCM; Masters, 1982) was originally developed for analyzing test items that require multiple steps and for which it is important to assign partial credit for completing several steps in the solution process. This model was designed to be used when partial credit can be awarded for degrees of success. The PCM is a divide-by-total or direct IRT model. The Partial Credit model can be considered as an extension of the Rasch Model and has all the standard Rasch model features. The equation for the partial credit model is shown below, where P ix x exp ( j ik ) k 0 ( j ) m i (2-3) r exp ( ik ) r 0 k 0 item i is scored x 0,...,m for an item with K m 1response categories; i i i 21

22 ( k 1,...,m ) is called the item step difficulty; it is associated with a category ik i score of k and x is the response category of interest. j is examinee j s proficiency level; P ix ( j ) is the probability that examinee j answers item i at category x correctly, by given proficiency level ; j x k 0 ( ) 0 (2-4) Testlet Model-Rasch Testlet Model To model examinees responses to testlet items, IRT testlet models have been proposed in which a random effect parameter is added to model the local dependence among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wang, Bradlow, & Wainer, 2002). Following this general approach, a simplified testlet model was generated by Wang and Wilson (2005) and it can be written as ik P ji1 exp( j bi d ( i) j ) (2-5) 1 exp( b ) j i d ( i) j where Pji1 is the probability that examinee j answers item i correctly (scoring 1); ~ N(0,1) is the ability of examinee j ; j 2 b ~ N(, ) is the difficulty of item i, and i b b 2 ~ N(0, ) is a random effect that represents the interaction of person j d( i) j d (i) with testlet d (i) (i.e., testlet d that contains item i ). 22

23 Local Item Dependence As we mentioned before, local item independence (LID) is the first a priori assumption of IRT models. It means that the item responses are conditionally independent given the latent trait. Therefore, there should not be any correlation between two items after controlling for the underlying trait. The items should only be correlated through the latent trait that the test is measuring (Lord and Novick, 1968). However, this LID assumption is nearly always violated in real applications. Sometimes, significant correlation among items remains after controlling for the effect of the latent trait. Because of these significant correlations, the items are locally dependent or there is a subsidiary dimension in the measurement that is not accounted for by the overarching dimension trait. Locally dependent items are always the cause of information loss for IRT models (Chen & Thissen, 1997). Several indices have been proposed to detect local item dependence for dichotomous item response models. Yen (1984, 1993) introduced the Q 3 statistic by comparing it with other traditional measures: Q 1 (Yen, 1981), Q 2 (Van den Wollenberg, 1982), and Signed Q 2 (Van den Wollenberg, 1982). The Q3 statistic is the inter-item correlation between item pairs once the effect of the latent trait is removed. Although the Q 3 statistic has been commonly used for several years, it has two major deficiencies in applied settings. First, theq 3 statistic requires a latent trait computation prior to calculating the item pair residual correlation. Second, the entire set of test data must be applied to compute the Q 3 statistic. Therefore, Chen and Thissen (1997) proposed four innovative LID indices to compute the expected frequency from IRT models. The 23

24 calculation of these four local dependence indices uses a subset of items without using the estimates. These four LID indices are Pearson s 2 X, Likelihood ratio G 2, Standardized coefficient difference, and Standardized log-odds ratio difference. These four indices are defined for a pair of items. Ponocny (2001) proposed a general family of conditional nonparametric tests to detect the differences between groups and items for Rasch models including item s local stochastic independence (e.g., T 1 ). By creating a two-by-two table for two items, 2 the comparison can be subjected to the standard -test or Fisher s exact test (Fischer, 1974). This test is able to detect the difference of covariance between the item pairs. The test of the local independence assumption can be conducted via a suitable contingency table (Ponocny, 2001). The general family of conditional nonparametric tests is implanted in the following extension of the Rasch model: T( A) n k n exp( ( r v v v) ( s 1 i 1 i i)) L( A, ) (2-6) (1 exp( )) v 1 i 1 where v is the examinee s ability parameter; v k i rv is the examinee s raw score; si is the item marginal sum for the item difficulty parameter ( i 1,..., k ). The random variable T is a sufficient statistic for the parameter which expresses a certain violation of the Rasch model (Ponocny, 2001). Based on the conditional nonparametric tests from Ponocny (2001), the local item dependence is demonstrated by inter-item correlation between item I i and item I j ( i j, i, j K ). The inter-item correlation is based on the ( 2 2 )-table by calculating the cases with equal responses i 24

25 on both items (Ponocny, 2001). The statistic T ( ) is applied for the local item dependence detection as below: 1 A T 1 ( A) x vi x (2-7) vj where indicates the Kronecker symbol with 1for xvi xvjand xvi x vj 0 x vi x vj otherwise. Then, a goodness-of-fit test is conducted to check the proportions of the correlation comparison between the model-implied estimates and the observed value from the matrix for two specific items. The sum of thet 1 's over the item pairs serves as a test statistic when two or more item pairs are investigated simultaneously (Ponocny, 2001). In the meantime, an overall test statistic ( T 11 ) for the local dependence of test is given by summing up the absolute deviation from the expected value x vi x vj ij to all inter-item correlations r ij in the test. The test statistic T 11 is shown as below (Ponocny, 2001): T ( A) ij r ij (2-8) 11 ij 2.2 Reliability In educational measurement, reliability is a statistical index to quantify and evaluate the consistency of test scores. If the local item independence assumption is violated, the measurement errors are underestimated so as to give an inflated reliability estimate. The circumstances where the local item independence assumption is violated commonly occur in testlets. The test construct is subject to the impact of measurement errors that are not related to the latent traits the test construct intends to measure. Thus, these measurement errors determine how reliably the test measures the construct. Test reliability has been consistently mentioned in previous testlet research. A concern about test reliability was expressed regarding the creation of super polytomous items to 25

26 manipulate testlets (Keller, Swaminathan, & Sireci, 2003). The approach that treats these testlets as polytomous items may lose the information contained in the response pattern so that the measurement errors may increase and reduce the overall test reliability (Keller et al., 2003). In addition, compared to the original dichotomous items, some parameters are dropped when the polytomous items are formed so the test reliability may decrease (Zenisky, Hambleton & Sireci, 2002). Yen (1993) also claimed that, when items are combined into testlet scores and some of the items within a testlet are locally dependent, the reliability will be underestimated. Thus, the comparison of the test reliabilities among the three models is especially necessary for model selection. 2.3 Survey in Application of Testlet The review of the testlet applied literature in the EBSCO Host and PsychInfo databases also identified the other possible factors that impact the application of testlets models. The testlet/independent item ratio within a test in terms of testlet number is another important factor in testlet research. Among the forty-one articles in which testlet numbers are specified, the general mean of the testlet numbers including subconditions within each article is 7.9 and the standard deviation of the testlet number was The largest testlet number design was fifty, which occurred in Wainer and Wang s (2000) article. There is one other study containing a large testlet number in their research design. Tokar, Fischer, Snell, and Harik-Williams (1991) included twenty testlets in their research. Except for these two large testlet number designs, all the other articles (39) contained three to fifteen testlets (e.g., Wainer, Lewis, 1990; Thissen, Steinberg & Mooney, 1989; Wang, Cheng & Wilson, 2005; Wainer, 1995; Yang & Gao, 2008). This range gave clear guidance for this study s research designs. The detailed 26

27 information on testlet numbers used in previous testlet studies is demonstrated in Table 2-1. Based on the same literature review, forty-three out of fifty-five studies identified test lengths in their research designs. Of all available forty-three articles, the distribution of test length ranged from 13 to 899. The mean test length (64.74) was obtained by first removing the largest test length (i.e. 899) from Wainer and Wang s article (2000); summing the remaining test lengths and dividing by the number of articles in which the test length was included in the design (Table 2-2). The research sample size is the third factor that can influence the analysis of testlets. From the testlet application literature, thirty-seven articles identified the sample size, with a mean of and standard deviation of Since some studies consist of extremely large sample size (i.e in Brandt s 2008 article and 8494 in Zenisk, Hambleton & Sireci s 2002 study), the median sample size of 681 may be more illustrative. The range of the sample sizes provides a guideline for our research design. First, in twelve out of the thirty-seven articles reviewed, researchers included sample sizes smaller than 500 (e.g., Adams, Wilson & Wang, 1997; Wang, 2005; Schmitt, 2002). Second, eighteen articles included sample sizes between 500 and 1000 (e.g., Adams, Wilson & Wang, 1997; Stark, Chernyshenko & Drasgow, 2004). Finally, twenty studies included sample sizes larger than 1000 (e.g., Brandt, 2008; Wainer & Wang, 2000; Thissen, Steinberg & Mooney, 1989). Table 2-3, details the information on sample sizes used in previous studies. Seventeen studies used the RMSE and loglikelihood ratio coefficient as the extraction criteria (e.g., Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006; 27

28 Armstrong, 2004). The next most commonly used criteria were reliability coefficient and Bias (used by nine, and 5 papers respectively) (e.g., Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006; Armstrong, 2004; Davis, 2003; Schmitt, 2002 ). Other various indices (e.g. AIC, WMSE, RMSEA, NNFI, CFI, GFI, Q3, RMS, etc) were used in twenty studies (e.g., Gessaroli, Folske, 2002; Schmitt, 2002; Adams, Wilson& Wang, 1997). Clearly, most researchers relied on the RMSE and loglikelihood ratio coefficient to compare the model fit and parameter estimates.table 2-4, reveals detailed information on the fit criteria used. Finally, the estimation methods were designated in the twenty-nine studies. Twenty-four of these articles applied the Marginal Maximum Likelihood (MML) method (e.g., Lee, 2006; Wang & Wilson, 2005; Wainer, 1995). Only five articles used the Markov Chain Monte Carlo (MCMC) method (e.g., Li, 2005; Li, 2006; Wang, 2002; Wainer & Wang, 2000). The data analysis iterations were only acknowledged in eleven articles (e.g., Lee, 2000; Ip, Smits & De Boeck, 2009; Stark, Chernyshenko & Drasgow, 2004). Among these eleven articles, five of them applied 100 iterations (e.g. Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006) and only two articles applied even more (200 and 600) iterations (Li, 2006; Zwick, 2002). Tables 2-5 and 2-6 include detailed information on the estimation method and iteration times used for all the studies reviewed. 28

29 Table 2-1. The number of testlets in the dataset Articles Testlet number Testlet number mean/article Testlet number general mean 7.9 SD of mean

30 Table 2-2. Test length in the reviewed articles Articles Test Length Mean length General mean

31 Table 2-3. Sample sizes in the reviewed articles Article No. sample size mean/paper 31 sample size < < sample size <1000 sample size >1000 sample size in paper set 1 set 2 set 3 set 4 set 5 set

32 Table 2-3. Continued Article No. sample size in paper sample size mean/paper sample size < < sample size <1000 sample size > General mean Standard Deviation % 48.65% 54.05% General median

33 Table Fit indices in reviewed articles Articles Bias RMSE Reliability coefficient loglikelihood ratio test WMSE AIC Other index Percentage 11.90% 40.48% 21.43% 40.48% 4.76% 2.38% 47.62% Table 2-5. Estimation method in reviewed articles Estimation method MML MCMC total Articles percentage 82.76% 17.24% Table 2-7. The number of simulation replication applied in the reviewed articles Replication number Total Frequency % 45.45% 18.18% 18.18% 9.09% 33

34 CHAPTER 3 METHOD A comprehensive review of the testlet research from 1989 to 2009 provides us a systematic framework for exploring the performance of three different IRT models to analyze testlets. These three models will be a part of two studies presented in this paper. The first is a series of simulation studies designed to investigate the extent to which the fluctuation of testlet conditions (testlet size, local dependence effects, etc.) influence the different model fitting results. Simulations are conducted to evaluate model fit, test reliability, and parameter recovery of the three different IRT models. Next, a real data analysis of the COMLEX-USA exam dataset is presented by fitting different models as an empirical case. The three one parameter IRT models adopted in the study are: the Rasch model, the Partial Credit Model, and the Rasch testlet model. 3.1 Model Used to Generate Data The current study evaluates the effect of changes in the local effect of testlets on the model fit, ability parameter recovery, and test reliability of three different IRT models. In order to quantify the extent of the local effect, the application of Rasch testlet model is appropriate for research data simulation. The Rasch testlet model (Wang& Wilson, 2005) includes a testlet parameter ( ) which is the random effect capturing d ( i) j the interaction of person j with testlet d(i) when the overarching latent trait is held constant. According to the definition of the testlet, the sum of testlet parameters ( ) over examinees within any testlet is zero 2 ( ~ N(0, ) ). Thus, the local effects of testlets in the Rasch Testlet model d( i) j d (i) are simulated from the normal distribution with a mean of zero, and standard deviation d ( i) j 34

35 of the square root of the given local effect values ( constraints are used to simulate the responses. 2 d (i ) With v=1,,v and V the total number of examinees, D d 1 ). The following prior model 0 for all v = 1,,V. (3-1) vd ( i) ( V, vd ( i) ) 0for all d 1,..., D (3-2) vd, ) 0 for all d 1,..., D (3-3) ( ( i) vd ( j) 3.2 Population Parameters Population item parameters for the Rasch testlet model and the ability parameters for the population are simulated from the normal distribution with the mean of zero, and standard deviation of one (i.e. ~ N(0,1) within a range from negative three to positive j three; [ 3,3] ). For each condition, the population item difficulty parameters are generated from the mean of zero, and standard deviation of one ( b ~ N( 0, 2 1) ) with a range of [ 3,3]. For simplicity, all simulated population parameters are rounded to three decimal places. The population item parameters and population ability parameters are randomly drawn from these two normal distributions ahead of each condition. 3.3 Condition Manipulated In this study, we examine whether fluctuations of testlet size, local dependence effects, and item difficulty within testlets affect the reliabilities and the model fit of three different IRT models. Our study is a four-factor completely crossed design: 2 (changes in testlet size) 4 (levels of local dependence effect) 3 (ratio of testlet items and general items in test) 3 (sample size). Table 3-1 demonstrates all the 72 i b b 35

36 conditions and the interactions of these four factors effect on the testlet research designs. 1. The first factor is the testlet size. The testlet sizes chosen for this study are based on the purpose of the study and the sizes less often discussed in the applied literature. Thus, two patterns of testlet size including small and medium testlet sizes are used in this study: ( 3, 5). 2. The second factor is the local dependence effect. Local dependence effects from the ten reviewed studies are within the range of zero to one (Wainer & Wang, 2000; Wang, 1999; Wang, 2002; Wang, 2005; Habing & Roussos, 2003; Adams, Wilson & Wang, 1997; Wang & Wilson, 2005; DeMars, 2006; Li, 2005; Zenisky, Hambleton & Sireci, 2002). Therefore, four levels of local dependence effect will 2 be examined: ( 0.25, 0.5, 0.75, 1). 3. The third factor is the ratio of testlet items to general items in the test. Among all 60 items in the test, the ratio of testlet items and general items will be (1 : 3, 1:1, 3 :1). 4. The fourth factor is the sample size of the examinees. Of seventy-four study groups in forty-five different articles from the applied literature, the distribution of sample size ranged from 10 to 8912; with two sample sizes greater than 8000 and four sample size smaller than 50. By dividing the remaining sixty-eight sample sizes into three groups according to the size ranking; and taking the approximate mean value of the sample sizes in each category, we selected three sample sizes for use in this study: ( 250, 500, 1000 ). These quantities represent rounded approximations of the most common sample size found in the applied literature. 5. Test length is the other issue that must be considered ahead of the research design. The test length of this simulation is set to sixty (60 items per test), the approximate general mean of the test length among the reviewed testlet literature. 6. For each condition, based on the largest occurrence of the iteration times in the applied literature, the value of the replication time is selected. Thus, one hundred replications are applied with each condition. 3.4 Data Generation The Rasch testlet model response data are generated using the statistical software R Response data were generated for 100 samples from a set of population item parameters (60 items) and population ability parameters (1000 trait 36

37 value j ) for each condition. Local effects were given per each testlet accordingly. Each simulee was assigned a known trait value j from the randomly selected population ability parameters. By comparing the difference between the co-effect of local effects within testlets plus the randomly selected population item parameters and the known trait value j from each simulee, the probability of observing the response matrix X ( x 1,..., x N ) from a sample of N independently responding examinees can be represented as P X, ) P( x,, ) P( x,, ) (3-4) ( i i d( i) j ij i j d( i) j i i j 37 where ( 1,..., N ), ( b 1,..., b J ), and d ( i) j are all considered unknown, fixed parameters. Thus, a response matrix with all logical indicators was generated for each replication within every condition. Then, a series of random numbers were given from a uniform distribution that ranged from 0 to 1 to match the logical response matrix accordingly. If the known trait value j is less than the co-effect of the item and testlet, the logical indicator is false setting the simulee s response to 0. On the other hand, if the known trait value j is larger than the co-effect of the item and testlet, the logical indicator is true which sets the simulee s response to 1 This process repeats for every item and every simulee in each of the 100 samples. Thus, 100 simulated responses are generated for each condition accordingly. 3.5 Parameter Estimation In the study, the parameters of the dataset in 3 different models (PCM, standard Rasch model, and Rasch Testlet model) are analyzed using Marginal Maximum

38 Likelihood (MML) methods with ConQuest Version 2.0. The most frequently used approaches to item parameter estimation for unknown trait levels are Joint Maximum Likelihood (JML), Conditional Maximum Likelihood (CML), and Marginal Maximum Likelihood (MML). Holland (1990) compared the different sampling theory foundations of these three ML methods. CML is possible only for the 1PL- model and is so computationally intensive as to be impractical in many situations. JML has been used extensively in early IRT programs. However, JML estimation also has some drawbacks for estimating IRT models. First, the JML item parameter estimates are biased and inconsistent for fixed length tests. Second, the JML standard errors are probably too small to handle the unknown person trait level (Holland, 1990). The most commonly used method for estimating the parameter of IRT models is Marginal Maximum Likelihood (MML). In MML estimation, unknown trait levels are estimated by expressing the response pattern probabilities as expectations from a population distribution. MML has several advantages over the other two ML methods. First, MML is applicable for all types of IRT models. Second, MML is efficient for tests with different lengths. Third, the MML estimate of item standard errors may be justified as good approximations of expected sampling variance of the estimates. Fourth, estimates are available for perfect scores. In the previous literature, Marginal Maximum Likelihood (MML) method is applied in 82.76% of the articles. Therefore, MML was chosen for the parameter estimation for this study. 38

39 The simplified mechanism of MML is shown below. The prior knowledge about the examinee distribution ( p ( ) ) is treated as a prior and the item difficulty parameter is indicated as. That is, MML estimates of the item difficulty parameter ( ) maximize i L( X) p( x, ) p( ) d (3-5) Therefore, a posterior distribution ( p( X )) is obtained for item parameters by multiplying L( X ) by p ( ) (Mislevy, 1986): i p( X) L( X) p( ) (3-6) 3.6 Ability Estimation In the study, the simulees performance on a test is scored based on their responses to items and the IRT models. The estimation of the simulees abilities is performed by two different approaches in this study. These two approaches are Maximum Likelihood Estimation (MLE; Lord, 1980) and Expected a Posteriori Estimation (EAP; Bock & Mislevy, 1982). The maximum likelihood estimation (MLE) is the most commonly used estimation procedure for examinees ability estimation. Based on the examinee s responses on the test, MLE finds the value of the latent trait that maximizes the likelihood of an item response pattern by holding the assumption that the item parameter values are known. The likelihood of the latent trait, given an item response pattern x, x,..., x ) is denoted as I 1, x2,..., xi ) Pi x ( ) i i 1 ( 1 2 i L( x (3-7) 39

40 where P( ) represents the probability of a given response to item i and the number i I is the number of items in the test. Although, MLE is the most common approaches for ability estimation, some drawbacks of MLE must be addressed. First, MLE is not available for all-endorsed or all-not-endorsed item response patterns. If these two item patterns exist, the results of MLE will go to infinity. Second, MLE may not converge when some response patterns are abnormal (Bock & Mislevy, 1982). Expected a posteriori (EAP) estimation is an efficient approach for examinee s trait estimation. EAP is a Bayesian estimator with non-iterative process. Unlike the MLE, EAP provides a finite estimation for all-endorsed or all-not-endorsed item response patterns. In fact, EAP estimation indicates the mean of the posterior distribution. For any test, a set of quadrature nodes ( Q ) are defined for a fixed number of specified trait r. There is a probability density W Q ) corresponding to each quadrature node. The EAP trait estimate is derived by ( r N [ Q L( Q ) ( )] 1 r W Q r r r (3-8) N { [ L( Q ) W( Q )]} r 1 where the L Q ) represents the exponent of the log-likelihood function evaluated at ( r each of the N quadrature nodes. However, some shortcomings of EAP should be mentioned. First, there is a tendency for Bayesian estimates to regress toward the mean of the prior distribution (Kim & Nicewander, 1993; Weiss, 1982). Since ConQuest provides EAP estimates for both with- and without- regression, the EAP estimates without regression were applied. The other shortcoming of EAP is that its estimation accuracy is reduced by an improper prior distribution (Bock & Mislevy, 1982). Since both MLE and EAP ability estimation approaches have their pros and cons, 40 r r

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Using the Score-based Testlet Method to Handle Local Item Dependence

Using the Score-based Testlet Method to Handle Local Item Dependence Using the Score-based Testlet Method to Handle Local Item Dependence Author: Wei Tao Persistent link: http://hdl.handle.net/2345/1363 This work is posted on escholarship@bc, Boston College University Libraries.

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model

Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model Journal of Educational Measurement Winter 2006, Vol. 43, No. 4, pp. 335 353 Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model Wen-Chung Wang National Chung Cheng University,

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock 1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Ordinal Data Modeling

Ordinal Data Modeling Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

Differential Item Functioning Amplification and Cancellation in a Reading Test

Differential Item Functioning Amplification and Cancellation in a Reading Test A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian

More information

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical

More information

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2009 Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Bradley R. Schlessman

More information

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models. Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and

More information

Copyright. Kelly Diane Brune

Copyright. Kelly Diane Brune Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Multidimensionality and Item Bias

Multidimensionality and Item Bias Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari * Electronic Journal of Applied Statistical Analysis EJASA (2012), Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 431 437 e-issn 2070-5948, DOI 10.1285/i20705948v5n3p431 2012 Università del Salento http://siba-ese.unile.it/index.php/ejasa/index

More information

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Having your cake and eating it too: multiple dimensions and a composite

Having your cake and eating it too: multiple dimensions and a composite Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan

More information

Scale Building with Confirmatory Factor Analysis

Scale Building with Confirmatory Factor Analysis Scale Building with Confirmatory Factor Analysis Latent Trait Measurement and Structural Equation Models Lecture #7 February 27, 2013 PSYC 948: Lecture #7 Today s Class Scale building with confirmatory

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times.

The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times. The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times By Suk Keun Im Submitted to the graduate degree program in Department of Educational

More information

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)

More information

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013 Why is item analysis relevant? Item analysis provides

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

Statistics for Social and Behavioral Sciences

Statistics for Social and Behavioral Sciences Statistics for Social and Behavioral Sciences Advisors: S.E. Fienberg W.J. van der Linden For other titles published in this series, go to http://www.springer.com/series/3463 Jean-Paul Fox Bayesian Item

More information

A Simulation Study on Methods of Correcting for the Effects of Extreme Response Style

A Simulation Study on Methods of Correcting for the Effects of Extreme Response Style Article Erschienen in: Educational and Psychological Measurement ; 76 (2016), 2. - S. 304-324 https://dx.doi.org/10.1177/0013164415591848 A Simulation Study on Methods of Correcting for the Effects of

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

You must answer question 1.

You must answer question 1. Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose

More information

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A Thesis Presented to The Academic Faculty by David R. King

More information

The effects of ordinal data on coefficient alpha

The effects of ordinal data on coefficient alpha James Madison University JMU Scholarly Commons Masters Theses The Graduate School Spring 2015 The effects of ordinal data on coefficient alpha Kathryn E. Pinder James Madison University Follow this and

More information

AN ANALYSIS OF THE ITEM CHARACTERISTICS OF THE CONDITIONAL REASONING TEST OF AGGRESSION

AN ANALYSIS OF THE ITEM CHARACTERISTICS OF THE CONDITIONAL REASONING TEST OF AGGRESSION AN ANALYSIS OF THE ITEM CHARACTERISTICS OF THE CONDITIONAL REASONING TEST OF AGGRESSION A Dissertation Presented to The Academic Faculty by Justin A. DeSimone In Partial Fulfillment of the Requirements

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Psychology, 2010, 1: doi: /psych Published Online August 2010 ( Psychology, 2010, 1: 194-198 doi:10.4236/psych.2010.13026 Published Online August 2010 (http://www.scirp.org/journal/psych) Using Generalizability Theory to Evaluate the Applicability of a Serial Bayes

More information

Applying the Minimax Principle to Sequential Mastery Testing

Applying the Minimax Principle to Sequential Mastery Testing Developments in Social Science Methodology Anuška Ferligoj and Andrej Mrvar (Editors) Metodološki zvezki, 18, Ljubljana: FDV, 2002 Applying the Minimax Principle to Sequential Mastery Testing Hans J. Vos

More information

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information