Optimizing distribution of Rating Scale Category in Rasch Model

Size: px
Start display at page:

Download "Optimizing distribution of Rating Scale Category in Rasch Model"

Transcription

1 Optimizing distribution of Rating Scale Category in Rasch Model Han-Dau Yau, Graduate Institute of Sports Training Science, National Taiwan Sport University, Taiwan Wei-Che Yao, Department of Statistics, National Taipei University, Taiwan Presented at the 76th Annual and the 17th International Meeting of the Psychometric Society, The Hong Kong Institute of Education, Hong Kong, July 19 22, 2011 Please address correspondence concerning this manuscript to: Han-Dau Yau Graduate Institute of Sports Training Science, College of Sports and Athletics National Taiwan Sport University 250, Wen Hua 1st Rd., Kueishan, Taoyuan County, Taiwan. Grant funding information: The Research Project was supported by grants from the National Science Council (NSC H ), Taiwan. The presentation was supported by grants from the National Science Council (NSC I A1), Taiwan.

2 Optimizing distribution of Rating Scale Category in Rasch Model Han-Dau Yau, & Wei-Che Yao* Graduate Institute of Sports Training Science, College of Sports and Athletics, National Taiwan Sport University; * Department of Statistics, National Taipei Universtiy. Abstract In tradition, testing of sports skills used to set categories by subjective method. Yau (2010) developed a new objective method of category setting, but did not point out what kind of distribution theory was used accurately. The purpose of the study was to optimize distribution of rating scale category in Rasch model. The studied objects are normal distribution, logistic distribution, binomial distribution, and uniform distribution. The method was to use SAS and Minitab to produce random data, and then use Winsteps to estimate data categories. Experiment design was two-way design (sample test length), and we simulated them five times for each cell. The results were: 1. Normal distribution was the ideal distribution when sample size was over 3000 in response data. 2. Logistic distribution was the better distribution when sample size was less than 1500 in response data. The conclusion of the study was that optimizing distribution of rating scale category in Rasch model was related to sample size. Key words: Rasch measurement, normal distribution, logistic distribution, binomial distribution, uniform distribution. I

3 Introduction After Rasch measurement was recommended to construct the sport skills tests. The original design of tests was based on subjective experience, and we found that the category setting of scale, the response information of subjects lacked for sport skills tests. It showed a problem of disorder category. Therefore, Yau (2010) had developed a setting approach of objective and reasonable for category, but he did not clarify which distribution theory was studied. In this study, it was a simulation study of the category setting of scale in sport skills tests, with the most suitable theoretical distribution to category of Rasch rating scale model. This study extended to develop the approach of categories setting in sport skill testing: an example of Multiple-Attempt Single-Item tests (Yau, 2010). In the development of test practice, the ability of item difficulty can not often fit subjects. In particular, Rasch rating scale model of test construction usually met the occurrence of the phenomenon of disorder categories. The focus of this review should modify the standard nuclear program, but how those standards should be divided between the scopes of the class will have a reasonable estimate. Therefore, this study attempts to explore the best of category of Rasch rating scale based on probability distribution theory. The study of category disorder were original from the result of Rasch model estimation, the published academic paper which began Linacre (1991) the "Step Disordering and Rasch Thurstone Thresholds," Linacre concluded: Rasch-Thurstone-type Thresholds provide the best estimate of the transitions between categories of the rating scale. Step disorder is of no concern, provided that the structure of the scale is conceptually sound. This article discussed the disorder of rating scale for categories that problem returned to the "Rasch-Thurstone-type threshold" theory. When the person s ability reached a higher level estimates, should person crossed the higher threshold. If you reached a higher threshold then it had been lower than the threshold level. Overview this paper, it did not analyze the categories of threshold issues from the actual response estimate, only on theoretical sources described. The original rating scale model was expanded from dichotomous Rasch model, and dichotomous Rasch model does not have disorder the problem, which only related to the level of item difficulty. We review former papers about the disorder categories researched below: 1. The reasons for the step difficulty disorder; Shaw, Wright and Linacre (1992) pointed out Each category represents greater success than the previous category. Observing category 2, however, is unlikely when compared to categories 1 or 3. Consequently step difficulties will be disordered. The step from 2 to 3 will be easier than from 1 to 2. This has nothing to do with the difficulty of the tasks. It is determined entirely by our peculiar specification of the rating scale, a specification that David would criticize. In the definition, the categories should be arranged in order, but estimate of actual frequencies of reactions might be in disorder category. It 1

4 showed that frequencies of the category is too small to make probability low, which can not show the highest probability at some capacity steps; in the skills tests of sports, the problem is inappropriate setting of range category, which the item difficulty and subject ability is not fit. When adjusting the range of category, which distribution theory should be based on need to clarify further research. 2. For the discussion of the reliability and validity of Rasch measurement: first of all, Stone and Wright (1994) found person separation was best with the categories grouped into 3 levels from the original five category scale. Second Bond and Fox (2007) description "guidelines for collapsing categories ", that proposed: " The rating scale diagnostics help us in determining the best categorization, knowledge of Rasch reliability and validity indices tell us how the measure is functioning as a whole. (p 231)" In other words, Bond and Fox thought the best categorization improved test reliability and validity. In recent years, some of the disorder categories researches had confirmed the affect of reliability and validity. For example: Tennant (2004) reported a recent paper for threshold disorder in health care, which caused the impact on the scale validity in functional independence measure. 3. Solution to the disorder categories; on disorder treatment, Andrich (1996) suggested that categories 2 and 3 should be combined and categories 4 and 5 likewise. Perhaps this disorder problem could be solved, but in fact, it would abandon a category. Yau, Chi, Zhou and Yao (2008) revised the standing long jump observation checklist. This study used the method of collapsed category to treat the disorder categories problem. Although, the revision of checklist fit theory, but it reduced some categories and the power of discrimination and validity simultaneously in the standing long jump observation checklist. In addition, Pesudovs, Gothwal, Wright, and Lamoureux (2010) found the problem in disorder category by combining treatment from the national eye institute visual function questionnaire : Because category 3 is a neutral category, it did not seem logical to combine it with an adjacent category. Another solution is made by Siegert, Jackson, Tennant, and Turner-Stokes (2010). They proposed the five steps of Rasch analysis. The main method was stepwise deletion of the worst fiting item. There is another new procedure for dealing with disorder categories. Moreover, Van Lente, Karabatsos, and Uekawa (2010) review traditional procedure that the particular recategorization of the rating scale that eliminates inconsistencies. However, there is no clear consensus as to which method is best. Theycreated a new approach (2010) This study introduces a sample-free method of rating scale optimization, based on the bootstrap, which addresses the issues just mentioned. The bootstrap is a general statistical procedure that simulates the population distribution by resampling from the original (sample) data set with replacement. Finally, Linacre (2010) advocated This paper suggests that transition categories should no longer be viewed as threats to valid measurement, but rather as an integral and increasingly important part of the advance of social science. Consequently, social-science measurement must improve the analysis and communication of transitional categories rather than atempt to eliminate them. Based on the above, the the disorder of 2

5 threshold value was the cause of short of clearly defined categories, which subjects were unable to distinguish. The solution method was to combine category, or delete the item of threshold disorder. If we combine category, it had to fit the logic reaction, as well. But both methods were to reduce category or delete item. The methods were regarded as a temporary way, and not the fundamental way to improve disorder categories. Finally Linacre (2010) pointed it promoted the measurement analysis, and clearly intended letter category which guided the future improvement threshold disorder for a good strategy. The ideas and this study were consistent with the development of category setting methods in sport skill testing. 4. Optimizing Rating Scale Category: Linacre (2001) pointed out When the thresholds are Rasch-Andrich thresholds or step calibrations, then disordering occurs when some categories never become modal, i.e., they are not observed frequently enough. This implies that they correspond to intervals on the latent variable narrower than about 1 logit in terms of Rasch-Thurstone thresholds. The study pointed out that category disorder was caused by the lack of observations frequency, and he proposed a disorder phenomenon when the threshold interval was less than 1.0 logit value. The contribution of this study: it presented objective criteria to judge poor category. Then, Linacre (2002) researched the appropriate range of category. Linacre made a number of guidelines: Preliminary Guideline: Al items oriented with latent variable. Guideline 1: At least 10 observations of each category. Guideline 2: Regular observation distribution. Guideline 3: Average measures advance monotonically with category. Guideline 4: OUTFIT mean-squares less than 2.0. Guideline 5: Step calibrations advance. Guideline 6: Ratings imply measures, and measures imply ratings. Guideline 7: Step difficulties advance by at least 1.4 logits. Guideline 8: Step difficulties advance by less than 5.0 logits. The guidelines for the appropriate category from the actual study was quite good. When the step difficulty was less than 1.4 logit, the category generated disorder. For example: Yau et al. (2008) researched the standing long jump observation checklist. 5. The study of disorder categories in sports psychology survey. In self-efficacy scale, Zhu, Updyke and Lewandowski (1997) used Rasch analysis to study the optimal categorization of a self-efficacy scale and had some results: (1) It was found that, instead of the five-category construct designed, the best order of category meanings of the scale in respondents' perceptions was a three-category construct. (2) Post-hoc Rasch analysis of optimal categorization of an ordered-response scale in determining the optimal categorization of an ordered-response scale. This study also used the combined category of methods. The categories were reduced from five to three categories, and the fitness of category was improved.zhu & Kang (1998) was found that, instead of the 5-category construct designed, the optimal order of category meanings of the scale in respondents' perceptions was a 3-category construct It is also found that the Rasch threshold estimates and separation statistics continuously played critical roles in determining the optimal categorization. Zhu, Timm and Ainsworth (2001) researched Rasch calibration and optimal categorization of an instrument measuring women s exercise perseverance and bariers.they found Instead of 3

6 the original five-category construct, which had a disordered internal construct, a collapsed three-category construct (i.e., Very Often/Often, Sometimes/Rarely, and Never) was found to have the optimal categorization. Zhu (2002) found scalewas modified from its original five-category structure ("Very often" = 1, "Often" = 2, "Sometimes" = 3, "Rarely" = 4, and "Never" = 5) to a three-category structure ("Very Often" = 1, "Sometimes" = 2, and "Never" = 3). These results verified that the characteristics of the optimal categorization identified by the Rasch post-hoc analysis can be maintained after the original scale was modified based on such an analysis. Myers, Feltz, and Wolfe (2008): used confirmatory studied for the category effectiveness of Rating Scale in coaching efficacy scale. They discussed the structure of optimal categorization, and concluded two approaches: 1. Exploratory study was to identify an effective categorizations structure through exploratory post hoc methods. 2. Confirmatory study was to confirm the effectiveness of this categorizations structure in a subsequent study. These were the study of psychological feeling, which category setting belonged to semantic field and was irrelevant to division areas of specific in sport skill testing, and the reasonable category setting of subjects was different. 6. The study of step disorder in sports testing. Chou and Yau (2006) An evaluation of the assessing the development level of the Standing Long Jump Observation Checklist (SLJOC). The average measure of the category 2 of the item 3 was This disorder in the average measures of categories might imply the disorder in the category definitions. Chou and Yau focused on evaluation, so only showed the disorder phenomenon and attributed to the problem of category definition, but did not try to find the solution to the problem. Yau et al. (2008) revision of the assessing the developmental level of the SLJOC, aimed at disorder when estimation, it used method of category combination to combine the second category and the third category in the second item, and the first category and the second category in both the third and the sixth items. The revised estimation got a better data-model fit. The solution to the problem in Yau et al. research is the method of combination category tactic from Bond and Fox (2007). On the other view, this kind of combination category tactic was a worse way to solve the problem. If we combined the categories constantly, it would make the discrimination reducing, which disobeyed the goal of category setting. Besides, in Chen (2008) a construction of forehand driving test for first level athletes in Table Tennis, this research was to find the proper category setting by testing and revising three times, but lack of the support of theory and concrete data it reluctantly showed there was not a disorder problem referring to the estimation from the software so we could use Rasch model to analyze data. Obviously we have to the right method of category setting that could solve problem thoroughly. Then Yau (2010) developed a reasonable categories setting approach in sport skill testing, but that could not know which distribution theory was better to use. Proper category distribution theory is the key to complete measurement successfully. So that building theory distribution of rating scale category in Rasch model is the remedy for solving disorder problem. Based on the trouble from the reference study, clear demand is to develop category 4

7 setting method of sports skill testing. When we use the Rasch model to construct testing, there is no practical category setting method about distribution ratio of rating scale category. If the estimation brings category disorder, we only have unwise method to apply. So it is the fundamental and better way to discuss the category distribution theory of testing construction that it could build reasonable theory distribution. Therefore the purpose of this study is to simulate the best theory distribution of rating scale category in Rasch model. Methods Objects of study Normal distribution, binomial distribution and logistic distribution are the relative theory distribution of rating scale category in Rasch model. The subjects of this study were above distribution and uniform distribution of proper category which was researched from Linacre (2002). Focused on logistic distribution firstly, Rasch (1960) supervised the process of over-time reading for student, and found the common measurement of students reading promotion measurement that it could be the tool to estimate reading ability accurately. This research started to use Poisson model to solve the problem of number of errors read and defined each ability and difficulty, (a person p of ability B p, a text t of difficulty D t ), which nature log was used and cumulate function was logistic distribution. Secondly, dichotomous model was distributed by Bernoulli distribution and the rating scale was ordinal category which was distributed by Binomial distribution. So we brought Binomial distribution into the subject of distribution theory study. For normal distribution, the practical data were distributed by normal distribution because one of the characteristics for mother group was distributed by normal distribution. Normal distribution was approximate to both binomial distribution and Poisson distribution (Ho & Yau, 1997). Besides, most of statistic problems were usually solved by normal distribution, so normal distribution was the subject of this study. Finally, for proper category range from Linacre (2002), Guideline 2 showed regular observation distribution: Iregularity in observation frequency across categories may signal aberant category usage. A uniform distribution of observations across categories is optimal for step calibration. Therefore, uniform distribution was concerned in this study. Procedures This study took an eye on categorical theory distribution by simulation, and used statistic software to simulate data and then estimated and verified category under each theoretical distribution. The followings were the processes of data simulation for Normal Distribution, Logistic Distribution, Binomial Distribution and Uniform Distribution, and the quality of category estimation under evaluating different theoretical distribution. 5

8 Data We used the normal, binomial, and ranuni functions from SAS to produce random data which were distributed by normal distribution, binomial distribution, and uniform distribution, and saved them as SAS permanent file (extend file named sas7bdat). Then we used logistic function from Minitab14 to produce random data which were distributed by logistic distribution and saved it as a Minitab14 file and transferred it to Excel file directly. Rasch Analysis This study used Winsteps to read SAS permanent file and Excel file to build Winsteps program and data file, then we could estimate parameter by Winsteps which evaluated the data-model fit in simulated data. Design There were four factors in the simulation study: distribution theory(dt), sampling frequency(sf), sample size(sp), and test length(tl). Distribution theory was decided by Rasch score measurement model, and the design of the study could be the best judgment by generalizability theory, which the proper level of other 3 factors should be decided first. At the pilot study, we defined 5 times random sampling (S), which drew 30 samples (P) each time, and the test length was ten items (I). The analysis model of generalizability theory was ( P : S ) I that meant P nested within S, and (P:S) crossed with I. The table 1 was result of G-Study. Table 1 Results of Generalizability Study Source of Variation Percent of Variance Components Binomial Logistic Normal Uniform Mean S 0.00% 0.00% 0.00% 0.00% 0.00% P:S 0.75% 2.99% 0.00% 0.00% 0.94% I 0.85% 0.00% 0.05% 0.02% 0.23% SI 0.55% 1.02% 0.00% 0.00% 0.39% PI:S 97.84% 95.99% 99.95% 99.98% 98.44% Total % % % % % Notes: P: sample size.s: sampling frequency. I: test length. By generalization study, this study found the percentage of variance components in the source of variation (SV). The mean of random error (PI:S) was 98.44% that occupied most of percentage, but the percentage of other SV were less than 1%. That showed the simulated random sampling data of this study fit random sampling theory. Secondly, the mean SV of sample size nested within sampling frequency (P:S) was 0.94% that was the second high value, and the mean SV of sampling frequency crossed with test length (SI) was 0.39% that was the third high value, and the mean SV of test length (I) was 0.23% that was the forth high value. Finally, the mean SV of sampling frequency (S) was proximate to zero that did not matter to us. 6

9 Based on the results of generalization study, the average source of variation of sample size occupied second large percentage because we set large sample (N=1000), median sample (N=300) and small sample (N=30) for each distribution that we concerned. The average SV of testing length occupied forth large percentage, so we defined testing length into short form (I=10) and long form (I=50). The average SV of sampling frequency approached zero so we did random sampling five times. The whole study samples: each combination simulation did random sampling five times, which produced 120 types of simulation data. The simulation data were sixed-categories data named from 0 to 5, and we compared the accuracy of parameter estimations, category threshold, and which category to score. The comparison showed the accuracy of parameter estimations, how many times and which category threshold disorder was produced, and which statistical distribution the data were distributed by. Besides the judgments suggested by Linacre (2002), this study brought up judgment indexes. Analysis This study evaluated the fit of random data that produced by different distribution theories by Rasch analysis and quality of categories by Winsteps software. From the estimation, we could obtain the measurement values of item, subject, and category, and estimated standard error, INFIT and OUTFIT, which explained the information of test (Reliability). The data (item, subject) model fit could prove the validly of test. The most important thing is to evaluate the random data that produced by each theoretical distributions, and the information for categorical frequency, percentage, step and threshold. We use the reasonable category standard (Linacre, 2002) to be the judgment on evaluating Rasch rating scale model. Results The result of the study was to analyze the data-model fit first and estimation accuracy. Then we analyzed the counts of ideal step difficulty, counts of category observation and disordering of step calibration in category estimation. Finally we used the above results to be comprehensive discussion. Category Fit Statistics Firstly we analyzed the data-model fit by Rasch measurement. When data fit the Rasch model, measurement was with validity and further inference was meaningful. Linacre (2002) Guideline 4: OUTFIT mean-squares less than 2.0. This judgment suggested the lowest category fit standard. The category fit of four kinds of simulation data from distribution theories were estimated by Rasch rating scale model. The following was the analysis result (table 2): (1) for normal distribution, the INFIT mean square is 0.997, the standard deviation is For logistic distribution, the INFIT mean square is 0.997, the standard deviation is For binomial distribution, the INFIT mean square is 0.999, the standard deviation is For 7

10 uniform distribution, the INFIT mean square is 0.997, the standard deviation is (2) For normal distribution, the OUTFIT mean square is 0.997, the standard deviation is For logistic distribution, the OUTFIT mean square is 0.997, the standard deviation is For binomial distribution, the OUTFIT mean square is 0.999, the standard deviation is For uniform distribution, the OUTFIT mean square is 0.996, the standard deviation is (3) Comparison of category fit for 4 distribution theories, the F value of INFIT mean square was and the F value of OUTFIT mean square was by ANOVA, and both of them did not reach significant level. Therefore the expectation of INFIT and the mean of OUTFIT approached to 1, and fit Guideline 4 the mean square of fit statistics of OUTFIT was less than 2.0. (Linacre, 2002) It meant estimated category of 4 distribution theories fit Rasch model, and the fit of category estimation was well and had measurement validity that it could be valid inference. Table 2 Fit Statistics of category in simulation data Fit Statistics Statistics Normal Logistic Binomial Uniform F Value INFIT Mean MNSQ SD OUTFIT Mean MNSQ SD * p <.05 Estimated standard error of categories We could know the accuracy of measurement estimation by category estimation standard error, and the following was the result of category estimation standard error by Rasch rating scale model (table 3 ) : (1) the interactions of three-factored ANOVA, DT SP TL, DT SP and the mean effect DT, SP,and TL reached significant level(p <.05). (2) The ω square of distribution theory was 0.069, and the ω square of sample size was 0.060, the ω square of test length was 0.044, the ω square of DT SP was 0.004, and the ω square of DT SP TL was Keppel (1991) pointed in the behavior science field, it was large when the ω square was 0.15, and medium whenω square was 0.06, and smal when the ω square was Therefore for effect size, distribution theory and sample size belonged to medium efect size, and test length belonged to smal efect size. DT SP and DT SP TL al belonged to smal efect size. (3) For the post hoc comparison of source of variation of medium efect size, by Schefe test distribution theory we found: (A) The estimated standard error of category estimation under Normal was which was larger than under Binomial, under Uniform and under Logistic. (B) The estimated standard error of category estimation under Binomial which was larger than under Logistic. Therefore the estimated standard error of category estimation under Normal was larger than the others obviously, and it was the smallest under Logistic, That meant the accuracy of category estimation was worse under Normal but better under Logistic. (4) The result of sample size of medium effect size in post hoc comparison by Schefe 8

11 test was: (A) Under small sample size the standard error of category estimation was which was bigger than under median sample size. (B) Under median sample size the standard error of category estimation was which was bigger than under large sample size. Therefore the standard error of category estimation was affected by sample size. The bigger the sample size was, the smaller the standard error of category estimation was. (5) The result of test length of effect size small in post hoc comparison by Schefe test was: the standard error of category estimation under short form was which was bigger than under long form, and it showed that the standard error of category estimation was affected by test length. When test length increased, the standard error of category estimation became smaller. (6) The effect size of DT SP and DT SP TL were tiny, so it was meaningless to discuss further even with test significant. Table 3 Summary of 3W ANOVA for estimated standard error in simulation data Source of Variation DF Type III SS MS F Value ω 2 a(dt) * b(sp) * a*b(dt SP) * c(tl) * a*c(dt TL) b*c(sp TL) a*b*c(dt SP TL) * Error Total Notes: * p <.05 DT= distribution theory SP= sample size TL= test length Step difficulties Based on proper category criterion by Linacre (2002): Guideline 7: Step dificulties advance by at least 1.4 logits. According to theproper criterion, we had to test the step difficulty of simulation data produced from 4 kinds of distribution theories whether the step dificulty was at the range from 1.4 to 5.0 logit by one sample t test, μ=1.4. (table 4) Table 4 Summary of one sample t-test (μ=1.4) for estimated step dificulty Distribution N Mean S D S Err t Value Normal * Logistic * Binomial * Uniform * * p <.05 One sample t test was used to test the step difficulty, and the null hypothesis isμ=1.4 and showed that all the test all reached the significant level (p <.05) : the mean of Normal is

12 which is bigger than 1.4 obviously, and the mean of Logistic, Binomial, and Uniform are 1.06, 0.87,and 0.03 which are all smaller than 1.4, normal distribution fit these two criterions only. Then we checked whether each observation value fit these two criterions or not. For the evaluation of the optimizing step difficulty, we obtained such a result for the frequency and percentage of optimizing step difficulty in simulation data. (table 5) Table 5 The frequency and percentage of optimizing step difficulty in simulation data sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 20(100%) 5(25%) 0(0%) 0(0%) (N=1000) short (I=10) 20(100%) 6(30%) 0(0%) 0(0%) medium long (I=50) 20(100%) 5(25%) 0(0%) 0(0%) (N=300) short (I=10) 19(95%) 9(45%) 0(0%) 0(0%) 3000 small long (I=50) 16(80%) 9(45%) 0(0%) 0(0%) 1500 (N=30) short (I=10) 10(50%) 10(50%) 4(20%) 0(0%) 300 Total 105(88%) 44(37%) 4(3%) 0(0%) Comprehensively, the step difficulty of Normal distribution simulation data fit model best. Decreasingly, there were 105, 44, 4, 0 observations in optimizing step difficulty for Normal, Logistic, Binomial, Uniform whose percentages were 88%, 37%, 3%, 0%. By there results we found: (1) Normal distribution was in best optimizing step difficulty when large or median sample sizes, but it became worse when small sample size. (2) Logistic distribution differed from Normal distribution, which optimizing step difficulty became increasing when small sample size. (3) The optimizing step difficulty of binomial distribution reached 20% only when small sample size and short form. (4) There was no optimizing step difficulty for uniform distribution, which this distribution did not fit Rasch measurement model. Category Frequencies Data of simulation study were produced by setting parameters randomly, and each amount of observations was produced randomly, as well. Based on proper category guideline by Linacre (2002) Guideline 1 said there were ten observations at least for each category, which was the lower limit. If there was no observation in it called category empty, it could affect the test validity and reliability, and combination of calibration disorder category approximately. Firstly we simulated data which were less than 10 observations for each distribution theories, and the results showed in the table 6. 10

13 Table 6 Summary of the number of categories for less than 10 sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 0(0%) 0(0%) 0(0%) 0(0%) (N=1000) short (I=10) 10(33%) 0(0%) 0(0%) 0(0%) medium long (I=50) 8(27%) 0(0%) 0(0%) 0(0%) (N=300) short (I=10) 10(33%) 0(0%) 0(0%) 0(0%) 3000 small long (I=50) 10(33%) 1(3%) 0(0%) 0(0%) 1500 (N=30) short (I=10) 12(40%) 10(33%) 6(20%) 0(0%) 300 Total 50(28%) 11(6%) 6(3%) 0(0%) For distribution theory: (1) For Normal, there were 50 categories (28%) whose amount of data was less than 10, which only large sample size and long form fit the guideline 1 called by Linacre. (2) For Logistic, there were 11 categories (6%) whose amount of data was less than 10, which happened to small sample size. (3) For Binomial, there were 6 categories (3%) whose amount of data was less than 10, which happened to small sample size and short form. (4) For Uniform, there was no category whose amount of data was less than 10. Therefore, For Normal, there were over one-forth categories dissatisfying Linacre guideline 1, which was the worst situation. For Logistic and Binomial, they dissatisfied Linacre guideline 1 only when small sample size, which were better than Normal. Only for Uniform, it satisfied Linacre guideline 1 totally. Table 7 Summary of the frequency and percentage for missing category sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 0(0%) 0(0%) 0(0%) 0(0%) (N=1000) short (I=10) 0(0%) 0(0%) 0(0%) 0(0%) medium long (I=50) 0(0%) 0(0%) 0(0%) 0(0%) (N=300) short (I=10) 0(0%) 0(0%) 0(0%) 0(0%) 3000 small long (I=50) 3(10%) 0(0%) 0(0%) 0(0%) 1500 (N=30) short (I=10) 10(33%) 0(0%) 0(0%) 0(0%) 300 Total 13(7%) 0(0%) 0(0%) 0(0%) Then, we analyzed the amount of estimation category missing for each distribution theory, and the following was the result (table 7): (1) For Normal, there were 3 categories missing (10%) at small sample size and long form, and 10 categories missing (33%) at small sample size and short form. (2) For Logistic, Binomial and Uniform, there was no category missing. Therefore based on sample size, when amount of response was 1500, normal distribution had category missing and decreasing, which was a defect, but the others had no category missing. The frequency of steps calibration disorder Linacre (2002) Guideline 3: Average measures advance monotonically with category, and Guideline 5: Step calibrations advance. Based on these guidelines, we judged the steps 11

14 calibration disorder of simulation data for 4 distribution theories. The results were in table 8: (1) Step calibration disorder did not happen to normal distribution, but both logistic and binomial distributions had 1 step calibration disorder only, and uniform distributions had 47 step calibration disorder. (2) Step calibration disorder was happened to logistic and binomial distributions under small sample size (amount of responses was under 3000). (3) For uniform distribution, step calibration disorder was happened under both sample size and test length combination, which the percentage of disorder accounts were from 30% to 45%. Therefore, for the step calibration disorder of data estimation, normal distribution fit Rasch model. And for both logistic and binomial distributions, step calibration disorder was happened only under small sample size, but uniform distribution did not fit Rasch model. Table 8 Summary of the frequency and percentage for steps calibration disorder sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 0(0%) 0(0%) 0(0%) 6(30%) (N=1000) short (I=10) 0(0%) 0(0%) 0(0%) 8(40%) medium long (I=50) 0(0%) 0(0%) 0(0%) 8(40%) (N=300) short (I=10) 0(0%) 0(0%) 0(0%) 8(40%) 3000 small long (I=50) 0(0%) 1(5%) 0(0%) 9(45%) 1500 (N=30) short (I=10) 0(0%) 0(0%) 1(5%) 8(40%) 300 Total 0(0%) 1(1%) 1(1%) 47(39%) Discussion This study simulated data produced by normal distribution, logistic distribution, binomial distribution and uniform distribution, and evaluated the best theoretical distribution of Rasch rating scale model category judged from the proper category criterion, Linacre (2002). First, the result of category fit statistics: categories all fit Rasch model for 4 distribution theories, and the fit of category estimation was well, which had measurement validity inference. Secondly, we compared the category estimation standard error for each distribution theory and results showed: (1) The accuracy of category estimation was worse under normal distribution, but better under logistic distribution. (2) Sample size affected the standard error of category estimation. The bigger the sample size is, the smaller the standard error of category estimation is. (3) Test length affected the standard error of category estimation, as well. The longer the test length is, the smaller the standard error of category estimation is. The third comparison was for the step difficulty of simulated data estimation for each distribution theory, and the result showed: Normal distribution was under best ideal step difficulty when large and median sample size, but opposite for logistic distribution, the phenomenon of optimizing step difficulty decreased when small sample size. Both binomial and uniform distribution did not fit Rasch measurement model. The forth analysis was for the amount of missing, and the result showed: Category missing was happened to normal distribution when the response accounts were 1500, and the categories decreased, which was a obvious defect. There was no category missing for other 3 12

15 distribution theories. The fifth analysis was for the step calibration disorder, and the results showed that normal distribution fit Rasch model, and the step calibration disorder was happened to both logistic and binomial distributions under small sample size, but uniform distribution did not fit Rasch model. The results were: (1) Normal distribution was the ideal distribution when sample size was over 3000 in response data. (2) Logistic distribution was the better distribution when sample size was less than 1500 in response data. The conclusion of the study was that optimizing distribution of rating scale category in Rasch model was related to sample size. This study (NSC H ) was sponsored by The National Security Council, R.O.C. (Taiwan). References Andrich, D. (1996). Category Ordering and their Utility. Rasch Measurement Transactions, 9(4), Bond, T. G., & Fox, C. M. (2007). Applying The Rasch Model: Fundamental Measurement in the Human Sciences(2nd). Mahwah, NJ: Lawrence Erlbaum Associates. Chen, W. C. (2008). A construction of forehand driving test for first level athletes in Table Tennis. Unpublished doctoral dissertation, National Taiwan Sport University(Taoyuan), Graduate Institute of Sports Training Science, Taoyuan. Chou, S. I. & Yau, H. D. (2006). An Evaluation of the Assessing the Development Level of the Standing Long Jump Observation Checklist. Paper presented at the 2nd Pacific Rim Objective Measurement Symposium, PROMS HK 2006, Hong Kong Institute of Education, Hong Kong , June. Ho, R. G., & Yau, H. D. (1997). The relationship between Binomial, Normal and Poisson distribution in the psychomotor test. Psychological Testing, 44(1), Keppel, G.. (1991).Design and analysis: A researcher s handbook (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Linacre, J. M. (1991). Step Disordering and Rasch Thurstone Thresholds. Rasch Measurement Transactions, 5(3), 171. Linacre J.M. (2001). Category, Step and Threshold: Definitions & Disordering. Rasch Measurement Transactions, 15(1), 794. Linacre, J. M. (2002). Understanding Rasch Measurement: Optimizing Rating Scale Category Effectiveness. The Journal of Applied Measurement, 3(1), Linacre, J. M. (2010). Transitional Categories and Usefully Disordered Thresholds. Online Educational Research Journal, 1. Retrieved March 28, 2011, from Myers, N. D., Feltz, D. L., & Wolfe, E. W. (2008). A Confirmatory Study of Rating Scale Category Effectiveness for the Coaching Efficacy Scale. Research Quarterly for Exercise and Sport, 79(3), 13

16 Pesudovs, K., Gothwal, V. K., Wright, T., & Lamoureux, E. L. (2010). Remediating serious flaws in the National Eye Institute Visual Function Questionnaire. Journal Cataract Refract Surg, 36, Rasch, G. (1960/80). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research). Expanded edition (1980) with foreword and afterword by B.D. Wright, (1980). Chicago: The University of Chicago Press. Siegert, R. J., Jackson, D. M., Tennant, A., & Turner-Stokes, L. (2010). Factor analysis and Rasch analysis of the Zarit Burden Interview for acquired brain injury carer research. Journal Rehabilitation Medicine, 42, Shaw, F., Wright, B., & Linacre, J. M. (1992). Disordered Steps? Rasch Measurement Transactions, 6(2), 225. Stone, M. H., & Wright, B. D. (1994). Maximizing rating scale information. Rasch Measurement Transactions, 8, 386. Tennant, A. (2004). Disordered Thresholds: An example from the Functional Independence Measure, Rasch Measurement Transactions, 17(4), Van Lente, E., Karabatsos, G., & Uekawa, K. (2010). A bootstrap approach to rating scale optimization. In W.P.F. Fisher & M. Wilson, (Eds.), Access to the Foundations of Measurement: Professional Identity in the Career of Benjamin D. Wright (forthcoming). Retrieved April 1, 2011, from Yau, H. D. (2010). A study of reasonable categories setting approach in sport skill testing: An example of Multiple-Attempt Single-Item tests. Research Project Report of National Science Council (NSC H ). Yau, H. D., Chi, S. C., Chou, S. I., & Yao, W. C. (2008). A revision of the assessing developmental level of the standing long jump. Journal of National Taiwan Sports University, 19(1), (NSC H ) Zhu, W., Updyke, W. F., & Lewandowski, C. (1997). Post-Hoc Rasch analysis of optimal categorization of an ordered-response scale. Journal of Outcome Measurement, 1(4), Zhu, W., & Kang, S. J. (1998). Cross-cultural stability of the optimal categorization of a self-efficacy scale: A Rasch analysis. Measurement in Physical Education and Exercise Science, 2, Zhu, W., Timm, G., & Ainsworth, B. (2001). Rasch calibration and optimal categorization of an instrument measuring women s exercise perseverance and bariers. Research Quarterly for Exercise and Sport, 72, Zhu, W. (2002). A confirmatory study of Rasch-based optimal categorization of a rating scale. Journal of Applied Measurement, 3,

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study RATER EFFECTS AND ALIGNMENT 1 Modeling Rater Effects in a Formative Mathematics Alignment Study An integrated assessment system considers the alignment of both summative and formative assessments with

More information

MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS

MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS Farzilnizam AHMAD a, Raymond HOLT a and Brian HENSON a a Institute Design, Robotic & Optimizations (IDRO), School of Mechanical

More information

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Kelly D. Bradley 1, Linda Worley, Jessica D. Cunningham, and Jeffery P. Bieber University

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Measurement issues in the use of rating scale instruments in learning environment research

Measurement issues in the use of rating scale instruments in learning environment research Cav07156 Measurement issues in the use of rating scale instruments in learning environment research Associate Professor Robert Cavanagh (PhD) Curtin University of Technology Perth, Western Australia Address

More information

Measuring the External Factors Related to Young Alumni Giving to Higher Education. J. Travis McDearmon, University of Kentucky

Measuring the External Factors Related to Young Alumni Giving to Higher Education. J. Travis McDearmon, University of Kentucky Measuring the External Factors Related to Young Alumni Giving to Higher Education Kathryn Shirley Akers 1, University of Kentucky J. Travis McDearmon, University of Kentucky 1 1 Please use Kathryn Akers

More information

Construct Validity of Mathematics Test Items Using the Rasch Model

Construct Validity of Mathematics Test Items Using the Rasch Model Construct Validity of Mathematics Test Items Using the Rasch Model ALIYU, R.TAIWO Department of Guidance and Counselling (Measurement and Evaluation Units) Faculty of Education, Delta State University,

More information

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Background of problem in assessment for elderly Key feature of CCAS Structural Framework of CCAS Methodology Result

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model

Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model Curt Hagquist Karlstad University, Karlstad, Sweden Address: Karlstad University SE-651 88 Karlstad Sweden

More information

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach Tabaran Institute of Higher Education ISSN 2251-7324 Iranian Journal of Language Testing Vol. 3, No. 1, March 2013 Received: Feb14, 2013 Accepted: March 7, 2013 Validation of an Analytic Rating Scale for

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Author s response to reviews

Author s response to reviews Author s response to reviews Title: The validity of a professional competence tool for physiotherapy students in simulationbased clinical education: a Rasch analysis Authors: Belinda Judd (belinda.judd@sydney.edu.au)

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National

More information

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign Reed Larson 2 University of Illinois, Urbana-Champaign February 28,

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Vol. 9, Issue 5, 2016 The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Kenneth D. Royal 1 Survey Practice 10.29115/SP-2016-0027 Sep 01, 2016 Tags: bias, item

More information

Comparing standard toughness through weighted and unweighted scores by three standard setting procedures

Comparing standard toughness through weighted and unweighted scores by three standard setting procedures Comparing standard toughness through weighted and unweighted scores by three standard setting procedures Abstract Tsai-Wei Huang National Chiayi University, Taiwan Ayres G. D Costa The Ohio State University

More information

Maximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean

Maximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean Maximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean Paul R. Yarnold, Ph.D., Fred B. Bryant, Ph.D., and Robert C. Soltysik, M.S. Optimal Data Analysis, LLC

More information

The following is an example from the CCRSA:

The following is an example from the CCRSA: Communication skills and the confidence to utilize those skills substantially impact the quality of life of individuals with aphasia, who are prone to isolation and exclusion given their difficulty with

More information

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale University of Connecticut DigitalCommons@UConn NERA Conference Proceedings 2010 Northeastern Educational Research Association (NERA) Annual Conference Fall 10-20-2010 Construct Invariance of the Survey

More information

THE COURSE EXPERIENCE QUESTIONNAIRE: A RASCH MEASUREMENT MODEL ANALYSIS

THE COURSE EXPERIENCE QUESTIONNAIRE: A RASCH MEASUREMENT MODEL ANALYSIS THE COURSE EXPERIENCE QUESTIONNAIRE: A RASCH MEASUREMENT MODEL ANALYSIS Russell F. Waugh Edith Cowan University Key words: attitudes, graduates, university, measurement Running head: COURSE EXPERIENCE

More information

The outcome of cataract surgery measured with the Catquest-9SF

The outcome of cataract surgery measured with the Catquest-9SF The outcome of cataract surgery measured with the Catquest-9SF Mats Lundstrom, 1 Anders Behndig, 2 Maria Kugelberg, 3 Per Montan, 3 Ulf Stenevi 4 and Konrad Pesudovs 5 1 EyeNet Sweden, Blekinge Hospital,

More information

METHODS. Participants

METHODS. Participants INTRODUCTION Stroke is one of the most prevalent, debilitating and costly chronic diseases in the United States (ASA, 2003). A common consequence of stroke is aphasia, a disorder that often results in

More information

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT Evaluating and Restructuring Science Assessments: An Example Measuring Student s Conceptual Understanding of Heat Kelly D. Bradley, Jessica D. Cunningham

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University

More information

By Hui Bian Office for Faculty Excellence

By Hui Bian Office for Faculty Excellence By Hui Bian Office for Faculty Excellence 1 Email: bianh@ecu.edu Phone: 328-5428 Location: 1001 Joyner Library, room 1006 Office hours: 8:00am-5:00pm, Monday-Friday 2 Educational tests and regular surveys

More information

Eye Movements, Strabismus, Amblyopia, and Neuro-Ophthalmology. Evaluation of the Adult Strabismus-20 (AS-20) Questionnaire Using Rasch Analysis

Eye Movements, Strabismus, Amblyopia, and Neuro-Ophthalmology. Evaluation of the Adult Strabismus-20 (AS-20) Questionnaire Using Rasch Analysis Eye Movements, Strabismus, Amblyopia, and Neuro-Ophthalmology Evaluation of the Adult Strabismus-20 (AS-20) Questionnaire Using Rasch Analysis David A. Leske, Sarah R. Hatt, Laura Liebermann, and Jonathan

More information

APSYCHOMETRIC STUDY OF THE MODEL OF HUMAN OCCUPATION SCREENING TOOL (MOHOST)

APSYCHOMETRIC STUDY OF THE MODEL OF HUMAN OCCUPATION SCREENING TOOL (MOHOST) HKJOT 2010;20(2):63 70 ORIGINAL ARTICLE APSYCHOMETRIC STUDY OF THE MODEL OF HUMAN OCCUPATION SCREENING TOOL (MOHOST) Gary Kielhofner 1, Chia-Wei Fan 2, Mary Morley 3, Mike Garnham 4, David Heasman 3, Kirsty

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Running head: PRELIM KSVS SCALES 1

Running head: PRELIM KSVS SCALES 1 Running head: PRELIM KSVS SCALES 1 Psychometric Examination of a Risk Perception Scale for Evaluation Anthony P. Setari*, Kelly D. Bradley*, Marjorie L. Stanek**, & Shannon O. Sampson* *University of Kentucky

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven

The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven Introduction The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven outcome measure that was developed to address the growing need for an ecologically valid functional communication

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

PTHP 7101 Research 1 Chapter Assignments

PTHP 7101 Research 1 Chapter Assignments PTHP 7101 Research 1 Chapter Assignments INSTRUCTIONS: Go over the questions/pointers pertaining to the chapters and turn in a hard copy of your answers at the beginning of class (on the day that it is

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

The validity of polytomous items in the Rasch model The role of statistical evidence of the threshold order

The validity of polytomous items in the Rasch model The role of statistical evidence of the threshold order Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 377-395 The validity of polytomous items in the Rasch model The role of statistical evidence of the threshold order Thomas Salzberger 1

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

The Use of Rasch Wright Map in Assessing Conceptual Understanding of Electricity

The Use of Rasch Wright Map in Assessing Conceptual Understanding of Electricity Pertanika J. Soc. Sci. & Hum. 25 (S): 81-88 (2017) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ The Use of Rasch Wright Map in Assessing Conceptual Understanding of Electricity

More information

Validation the Measures of Self-Directed Learning: Evidence from Confirmatory Factor Analysis and Multidimensional Item Response Analysis

Validation the Measures of Self-Directed Learning: Evidence from Confirmatory Factor Analysis and Multidimensional Item Response Analysis Doi:10.5901/mjss.2015.v6n4p579 Abstract Validation the Measures of Self-Directed Learning: Evidence from Confirmatory Factor Analysis and Multidimensional Item Response Analysis Chaiwichit Chianchana Faculty

More information

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT Basic Concepts, Parameters and Statistics The designations employed and the presentation of material in this information product

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Answers to end of chapter questions

Answers to end of chapter questions Answers to end of chapter questions Chapter 1 What are the three most important characteristics of QCA as a method of data analysis? QCA is (1) systematic, (2) flexible, and (3) it reduces data. What are

More information

Model fit and robustness? - A critical look at the foundation of the PISA project

Model fit and robustness? - A critical look at the foundation of the PISA project Model fit and robustness? - A critical look at the foundation of the PISA project Svend Kreiner, Dept. of Biostatistics, Univ. of Copenhagen TOC The PISA project and PISA data PISA methodology Rasch item

More information

Evaluation of the Short-Form Health Survey (SF-36) Using the Rasch Model

Evaluation of the Short-Form Health Survey (SF-36) Using the Rasch Model American Journal of Public Health Research, 2015, Vol. 3, No. 4, 136-147 Available online at http://pubs.sciepub.com/ajphr/3/4/3 Science and Education Publishing DOI:10.12691/ajphr-3-4-3 Evaluation of

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2016 Students' perceived understanding and competency

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

Measurement Issues in Concussion Testing

Measurement Issues in Concussion Testing EVIDENCE-BASED MEDICINE Michael G. Dolan, MA, ATC, CSCS, Column Editor Measurement Issues in Concussion Testing Brian G. Ragan, PhD, ATC University of Northern Iowa Minsoo Kang, PhD Middle Tennessee State

More information

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology* Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology* Timothy Teo & Chwee Beng Lee Nanyang Technology University Singapore This

More information

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative

More information

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology ISC- GRADE XI HUMANITIES (2018-19) PSYCHOLOGY Chapter 2- Methods of Psychology OUTLINE OF THE CHAPTER (i) Scientific Methods in Psychology -observation, case study, surveys, psychological tests, experimentation

More information

On the Construct Validity of an Analytic Rating Scale for Speaking Assessment

On the Construct Validity of an Analytic Rating Scale for Speaking Assessment On the Construct Validity of an Analytic Rating Scale for Speaking Assessment Chunguang Tian 1,2,* 1 Foreign Languages Department, Binzhou University, Binzhou, P.R. China 2 English Education Department,

More information

COMPARISON OF DIFFERENT SCALING METHODS FOR EVALUATING FACTORS IMPACT STUDENTS ACADEMIC GROWTH

COMPARISON OF DIFFERENT SCALING METHODS FOR EVALUATING FACTORS IMPACT STUDENTS ACADEMIC GROWTH International Journal of Innovative Management Information & Production ISME International c 2014 ISSN 2185-5455 Volume 5, Number 1, March 2014 PP. 62-72 COMPARISON OF DIFFERENT SCALING METHODS FOR EVALUATING

More information

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior 1 Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior Gregory Francis Department of Psychological Sciences Purdue University gfrancis@purdue.edu

More information

Interpersonal Citizenship Motivation: A Rating Scale Validity of Rasch Model Measurement

Interpersonal Citizenship Motivation: A Rating Scale Validity of Rasch Model Measurement Interpersonal Citizenship Motivation: A Rating Scale Validity of Rasch Model Measurement Shereen Noranee, Noormala Amir Ishak, Raja Munirah Raja Mustapha, Rozilah Abdul Aziz, and Rohana Mat Som Abstract

More information

How Do We Assess Students in the Interpreting Examinations?

How Do We Assess Students in the Interpreting Examinations? How Do We Assess Students in the Interpreting Examinations? Fred S. Wu 1 Newcastle University, United Kingdom The field of assessment in interpreter training is under-researched, though trainers and researchers

More information

Optimizing Rating Scale Category Effectiveness. John M. Linacre MESA Psychometric Laboratory University of Chicago

Optimizing Rating Scale Category Effectiveness. John M. Linacre MESA Psychometric Laboratory University of Chicago Optimizing Rating Scale Category Effectiveness John M. Linacre MESA Psychometric Laboratory University of Chicago Journal of Applied Measurement 3:1 2002 p.85-106. Investigating... p. 1 Abstract Rating

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Chapter 12: Introduction to Analysis of Variance

Chapter 12: Introduction to Analysis of Variance Chapter 12: Introduction to Analysis of Variance of Variance Chapter 12 presents the general logic and basic formulas for the hypothesis testing procedure known as analysis of variance (ANOVA). The purpose

More information

Implicit Information in Directionality of Verbal Probability Expressions

Implicit Information in Directionality of Verbal Probability Expressions Implicit Information in Directionality of Verbal Probability Expressions Hidehito Honda (hito@ky.hum.titech.ac.jp) Kimihiko Yamagishi (kimihiko@ky.hum.titech.ac.jp) Graduate School of Decision Science

More information

Chapter 3. Psychometric Properties

Chapter 3. Psychometric Properties Chapter 3 Psychometric Properties Reliability The reliability of an assessment tool like the DECA-C is defined as, the consistency of scores obtained by the same person when reexamined with the same test

More information

Rasch Model Analysis On Teachers Epistemological Beliefs

Rasch Model Analysis On Teachers Epistemological Beliefs Rasch Model Analysis On Teachers Epistemological Beliefs Amar Ma ruf & Mohamed Najib Abdul Ghafar & Samah Ali Mohsen Mofreh Abstract Epistemological Beliefs are fundamental assumptions about the nature

More information

Chapter 9. Youth Counseling Impact Scale (YCIS)

Chapter 9. Youth Counseling Impact Scale (YCIS) Chapter 9 Youth Counseling Impact Scale (YCIS) Background Purpose The Youth Counseling Impact Scale (YCIS) is a measure of perceived effectiveness of a specific counseling session. In general, measures

More information

Approaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM)

Approaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM) Approaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM) Kang Soo- Jin, RN, PhD, Assistant Professor Daegu University,

More information

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions Readings: OpenStax Textbook - Chapters 1 5 (online) Appendix D & E (online) Plous - Chapters 1, 5, 6, 13 (online) Introductory comments Describe how familiarity with statistical methods can - be associated

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1

USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1 Ecology, 75(3), 1994, pp. 717-722 c) 1994 by the Ecological Society of America USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1 OF CYNTHIA C. BENNINGTON Department of Biology, West

More information

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 This course does not cover how to perform statistical tests on SPSS or any other computer program. There are several courses

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model Chapter 7 Latent Trait Standardization of the Benzodiazepine Dependence Self-Report Questionnaire using the Rasch Scaling Model C.C. Kan 1, A.H.G.S. van der Ven 2, M.H.M. Breteler 3 and F.G. Zitman 1 1

More information

You must answer question 1.

You must answer question 1. Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose

More information

Power of the test of One-Way Anova after transforming with large sample size data

Power of the test of One-Way Anova after transforming with large sample size data Available online at www.sciencedirect.com Procedia Social and Behavioral Sciences 9 (2010) 933 937 WCLTA-2010 Power of the test of One-Way Anova after transforming with large sample size data Natcha Mahapoonyanont

More information

Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers

Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers Journal of Mathematics and System Science (01) 8-1 D DAVID PUBLISHING Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers Elisabeth Svensson Department of Statistics, Örebro

More information

THE FIRST VALIDITY OF SHARED MEDICAL DECISIONMAKING QUESTIONNAIRE IN TAIWAN

THE FIRST VALIDITY OF SHARED MEDICAL DECISIONMAKING QUESTIONNAIRE IN TAIWAN THE FIRST VALIDITY OF SHARED MEDICAL DECISIONMAKING QUESTIONNAIRE IN TAIWAN Chi-CHANG CHANG 1 1 School of Medical Informatics, Chung Shan Medical University, and Information Technology Office of Chung

More information

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory Kate DeRoche, M.A. Mental Health Center of Denver Antonio Olmos, Ph.D. Mental Health

More information

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, The Western Aphasia Battery (WAB) (Kertesz, 1982) is used to classify aphasia by classical type, measure overall severity, and measure change over time. Despite its near-ubiquitousness, it has significant

More information

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University. Running head: ASSESS MEASUREMENT INVARIANCE Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies Xiaowen Zhu Xi an Jiaotong University Yanjie Bian Xi an Jiaotong

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to

More information

Development of the Mental, Emotional, and Bodily Toughness Inventory in Collegiate Athletes and Nonathletes

Development of the Mental, Emotional, and Bodily Toughness Inventory in Collegiate Athletes and Nonathletes Journal of Athletic Training 2008;43(2):125 132 g by the National Athletic Trainers Association, Inc www.nata.org/jat original research Development of the Mental, Emotional, and Bodily Toughness Inventory

More information

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL) EVALUATION OF MATHEMATICS ACHIEVEMENT TEST: A COMPARISON BETWEEN CLASSICAL TEST THEORY (CTT)AND ITEM RESPONSE THEORY (IRT) Eluwa, O. Idowu 1, Akubuike N. Eluwa 2 and Bekom K. Abang 3 1& 3 Dept of Educational

More information

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison

Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science Rochester Institute

More information

THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES. Rink Hoekstra University of Groningen, The Netherlands

THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES. Rink Hoekstra University of Groningen, The Netherlands THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES Rink University of Groningen, The Netherlands R.@rug.nl Significance testing has been criticized, among others, for encouraging researchers to focus

More information

Chapter 2--Norms and Basic Statistics for Testing

Chapter 2--Norms and Basic Statistics for Testing Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.

More information

Validity and Reliability of the Malaysian Creativity and Innovation Instrument (MyCrIn) using the Rasch Measurement Model

Validity and Reliability of the Malaysian Creativity and Innovation Instrument (MyCrIn) using the Rasch Measurement Model Validity and Reliability of the sian Creativity and Innovation Instrument (MyCrIn) using the Rasch Measurement Model SITI RAHAYAH ARIFFIN, FARHANA AHMAD KATRAN, AYESHA ABDULLAH NAJIEB BADIB & NUR AIDAH

More information

Low Tolerance Long Duration (LTLD) Stroke Demonstration Project

Low Tolerance Long Duration (LTLD) Stroke Demonstration Project Low Tolerance Long Duration (LTLD) Stroke Demonstration Project Interim Summary Report October 25 Table of Contents 1. INTRODUCTION 3 1.1 Background.. 3 2. APPROACH 4 2.1 LTLD Stroke Demonstration Project

More information

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment Marshall University Marshall Digital Scholar Management Faculty Research Management, Marketing and MIS Fall 11-14-2009 Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment Wai Kwan

More information

(C) GoldCal LLC DBA GoldSRD

(C) GoldCal LLC DBA GoldSRD ROOT CAUSE ANALYSIS Danny M. Goldberg, Founder Root Cause Analysis Problem or Unwanted Event Occurrence Symptoms Problem or Unwanted Event Recurrence Apparent Cause Root Cause Prevent Wilson, Dell, and

More information

Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects

Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects Journal of Physics: Conference Series OPEN ACCESS Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects Recent citations - Adaptive Measurement

More information

RASCH ANALYSIS OF SOME MMPI-2 SCALES IN A SAMPLE OF UNIVERSITY FRESHMEN

RASCH ANALYSIS OF SOME MMPI-2 SCALES IN A SAMPLE OF UNIVERSITY FRESHMEN International Journal of Arts & Sciences, CD-ROM. ISSN: 1944-6934 :: 08(03):107 150 (2015) RASCH ANALYSIS OF SOME MMPI-2 SCALES IN A SAMPLE OF UNIVERSITY FRESHMEN Enrico Gori University of Udine, Italy

More information

Measuring change in training programs: An empirical illustration

Measuring change in training programs: An empirical illustration Psychology Science Quarterly, Volume 50, 2008 (3), pp. 433-447 Measuring change in training programs: An empirical illustration RENATO MICELI 1, MICHELE SETTANNI 1 & GIULIO VIDOTTO 2 Abstract The implementation

More information

O ver the years, researchers have been concerned about the possibility that selfreport

O ver the years, researchers have been concerned about the possibility that selfreport A Psychometric Investigation of the Marlowe Crowne Social Desirability Scale Using Rasch Measurement Hyunsoo Seol The author used Rasch measurement to examine the reliability and validity of 382 Korean

More information

The Effect of Option Homogeneity in Multiple- Choice Items

The Effect of Option Homogeneity in Multiple- Choice Items Manuscripts The Effect of Option Homogeneity in Multiple- Choice Items Applied Psychological Measurement 1 12 Ó The Author(s) 2018 Reprints and permissions: sagepub.com/journalspermissions.nav DOI: 10.1177/0146621618770803

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

A Brief (very brief) Overview of Biostatistics. Jody Kreiman, PhD Bureau of Glottal Affairs

A Brief (very brief) Overview of Biostatistics. Jody Kreiman, PhD Bureau of Glottal Affairs A Brief (very brief) Overview of Biostatistics Jody Kreiman, PhD Bureau of Glottal Affairs What We ll Cover Fundamentals of measurement Parametric versus nonparametric tests Descriptive versus inferential

More information

CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology

CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY 7.1 Introduction This chapter addresses the research design and describes the research methodology employed in this study. The sample and sampling procedure is

More information