Optimizing distribution of Rating Scale Category in Rasch Model
|
|
- Lydia Hoover
- 5 years ago
- Views:
Transcription
1 Optimizing distribution of Rating Scale Category in Rasch Model Han-Dau Yau, Graduate Institute of Sports Training Science, National Taiwan Sport University, Taiwan Wei-Che Yao, Department of Statistics, National Taipei University, Taiwan Presented at the 76th Annual and the 17th International Meeting of the Psychometric Society, The Hong Kong Institute of Education, Hong Kong, July 19 22, 2011 Please address correspondence concerning this manuscript to: Han-Dau Yau Graduate Institute of Sports Training Science, College of Sports and Athletics National Taiwan Sport University 250, Wen Hua 1st Rd., Kueishan, Taoyuan County, Taiwan. Grant funding information: The Research Project was supported by grants from the National Science Council (NSC H ), Taiwan. The presentation was supported by grants from the National Science Council (NSC I A1), Taiwan.
2 Optimizing distribution of Rating Scale Category in Rasch Model Han-Dau Yau, & Wei-Che Yao* Graduate Institute of Sports Training Science, College of Sports and Athletics, National Taiwan Sport University; * Department of Statistics, National Taipei Universtiy. Abstract In tradition, testing of sports skills used to set categories by subjective method. Yau (2010) developed a new objective method of category setting, but did not point out what kind of distribution theory was used accurately. The purpose of the study was to optimize distribution of rating scale category in Rasch model. The studied objects are normal distribution, logistic distribution, binomial distribution, and uniform distribution. The method was to use SAS and Minitab to produce random data, and then use Winsteps to estimate data categories. Experiment design was two-way design (sample test length), and we simulated them five times for each cell. The results were: 1. Normal distribution was the ideal distribution when sample size was over 3000 in response data. 2. Logistic distribution was the better distribution when sample size was less than 1500 in response data. The conclusion of the study was that optimizing distribution of rating scale category in Rasch model was related to sample size. Key words: Rasch measurement, normal distribution, logistic distribution, binomial distribution, uniform distribution. I
3 Introduction After Rasch measurement was recommended to construct the sport skills tests. The original design of tests was based on subjective experience, and we found that the category setting of scale, the response information of subjects lacked for sport skills tests. It showed a problem of disorder category. Therefore, Yau (2010) had developed a setting approach of objective and reasonable for category, but he did not clarify which distribution theory was studied. In this study, it was a simulation study of the category setting of scale in sport skills tests, with the most suitable theoretical distribution to category of Rasch rating scale model. This study extended to develop the approach of categories setting in sport skill testing: an example of Multiple-Attempt Single-Item tests (Yau, 2010). In the development of test practice, the ability of item difficulty can not often fit subjects. In particular, Rasch rating scale model of test construction usually met the occurrence of the phenomenon of disorder categories. The focus of this review should modify the standard nuclear program, but how those standards should be divided between the scopes of the class will have a reasonable estimate. Therefore, this study attempts to explore the best of category of Rasch rating scale based on probability distribution theory. The study of category disorder were original from the result of Rasch model estimation, the published academic paper which began Linacre (1991) the "Step Disordering and Rasch Thurstone Thresholds," Linacre concluded: Rasch-Thurstone-type Thresholds provide the best estimate of the transitions between categories of the rating scale. Step disorder is of no concern, provided that the structure of the scale is conceptually sound. This article discussed the disorder of rating scale for categories that problem returned to the "Rasch-Thurstone-type threshold" theory. When the person s ability reached a higher level estimates, should person crossed the higher threshold. If you reached a higher threshold then it had been lower than the threshold level. Overview this paper, it did not analyze the categories of threshold issues from the actual response estimate, only on theoretical sources described. The original rating scale model was expanded from dichotomous Rasch model, and dichotomous Rasch model does not have disorder the problem, which only related to the level of item difficulty. We review former papers about the disorder categories researched below: 1. The reasons for the step difficulty disorder; Shaw, Wright and Linacre (1992) pointed out Each category represents greater success than the previous category. Observing category 2, however, is unlikely when compared to categories 1 or 3. Consequently step difficulties will be disordered. The step from 2 to 3 will be easier than from 1 to 2. This has nothing to do with the difficulty of the tasks. It is determined entirely by our peculiar specification of the rating scale, a specification that David would criticize. In the definition, the categories should be arranged in order, but estimate of actual frequencies of reactions might be in disorder category. It 1
4 showed that frequencies of the category is too small to make probability low, which can not show the highest probability at some capacity steps; in the skills tests of sports, the problem is inappropriate setting of range category, which the item difficulty and subject ability is not fit. When adjusting the range of category, which distribution theory should be based on need to clarify further research. 2. For the discussion of the reliability and validity of Rasch measurement: first of all, Stone and Wright (1994) found person separation was best with the categories grouped into 3 levels from the original five category scale. Second Bond and Fox (2007) description "guidelines for collapsing categories ", that proposed: " The rating scale diagnostics help us in determining the best categorization, knowledge of Rasch reliability and validity indices tell us how the measure is functioning as a whole. (p 231)" In other words, Bond and Fox thought the best categorization improved test reliability and validity. In recent years, some of the disorder categories researches had confirmed the affect of reliability and validity. For example: Tennant (2004) reported a recent paper for threshold disorder in health care, which caused the impact on the scale validity in functional independence measure. 3. Solution to the disorder categories; on disorder treatment, Andrich (1996) suggested that categories 2 and 3 should be combined and categories 4 and 5 likewise. Perhaps this disorder problem could be solved, but in fact, it would abandon a category. Yau, Chi, Zhou and Yao (2008) revised the standing long jump observation checklist. This study used the method of collapsed category to treat the disorder categories problem. Although, the revision of checklist fit theory, but it reduced some categories and the power of discrimination and validity simultaneously in the standing long jump observation checklist. In addition, Pesudovs, Gothwal, Wright, and Lamoureux (2010) found the problem in disorder category by combining treatment from the national eye institute visual function questionnaire : Because category 3 is a neutral category, it did not seem logical to combine it with an adjacent category. Another solution is made by Siegert, Jackson, Tennant, and Turner-Stokes (2010). They proposed the five steps of Rasch analysis. The main method was stepwise deletion of the worst fiting item. There is another new procedure for dealing with disorder categories. Moreover, Van Lente, Karabatsos, and Uekawa (2010) review traditional procedure that the particular recategorization of the rating scale that eliminates inconsistencies. However, there is no clear consensus as to which method is best. Theycreated a new approach (2010) This study introduces a sample-free method of rating scale optimization, based on the bootstrap, which addresses the issues just mentioned. The bootstrap is a general statistical procedure that simulates the population distribution by resampling from the original (sample) data set with replacement. Finally, Linacre (2010) advocated This paper suggests that transition categories should no longer be viewed as threats to valid measurement, but rather as an integral and increasingly important part of the advance of social science. Consequently, social-science measurement must improve the analysis and communication of transitional categories rather than atempt to eliminate them. Based on the above, the the disorder of 2
5 threshold value was the cause of short of clearly defined categories, which subjects were unable to distinguish. The solution method was to combine category, or delete the item of threshold disorder. If we combine category, it had to fit the logic reaction, as well. But both methods were to reduce category or delete item. The methods were regarded as a temporary way, and not the fundamental way to improve disorder categories. Finally Linacre (2010) pointed it promoted the measurement analysis, and clearly intended letter category which guided the future improvement threshold disorder for a good strategy. The ideas and this study were consistent with the development of category setting methods in sport skill testing. 4. Optimizing Rating Scale Category: Linacre (2001) pointed out When the thresholds are Rasch-Andrich thresholds or step calibrations, then disordering occurs when some categories never become modal, i.e., they are not observed frequently enough. This implies that they correspond to intervals on the latent variable narrower than about 1 logit in terms of Rasch-Thurstone thresholds. The study pointed out that category disorder was caused by the lack of observations frequency, and he proposed a disorder phenomenon when the threshold interval was less than 1.0 logit value. The contribution of this study: it presented objective criteria to judge poor category. Then, Linacre (2002) researched the appropriate range of category. Linacre made a number of guidelines: Preliminary Guideline: Al items oriented with latent variable. Guideline 1: At least 10 observations of each category. Guideline 2: Regular observation distribution. Guideline 3: Average measures advance monotonically with category. Guideline 4: OUTFIT mean-squares less than 2.0. Guideline 5: Step calibrations advance. Guideline 6: Ratings imply measures, and measures imply ratings. Guideline 7: Step difficulties advance by at least 1.4 logits. Guideline 8: Step difficulties advance by less than 5.0 logits. The guidelines for the appropriate category from the actual study was quite good. When the step difficulty was less than 1.4 logit, the category generated disorder. For example: Yau et al. (2008) researched the standing long jump observation checklist. 5. The study of disorder categories in sports psychology survey. In self-efficacy scale, Zhu, Updyke and Lewandowski (1997) used Rasch analysis to study the optimal categorization of a self-efficacy scale and had some results: (1) It was found that, instead of the five-category construct designed, the best order of category meanings of the scale in respondents' perceptions was a three-category construct. (2) Post-hoc Rasch analysis of optimal categorization of an ordered-response scale in determining the optimal categorization of an ordered-response scale. This study also used the combined category of methods. The categories were reduced from five to three categories, and the fitness of category was improved.zhu & Kang (1998) was found that, instead of the 5-category construct designed, the optimal order of category meanings of the scale in respondents' perceptions was a 3-category construct It is also found that the Rasch threshold estimates and separation statistics continuously played critical roles in determining the optimal categorization. Zhu, Timm and Ainsworth (2001) researched Rasch calibration and optimal categorization of an instrument measuring women s exercise perseverance and bariers.they found Instead of 3
6 the original five-category construct, which had a disordered internal construct, a collapsed three-category construct (i.e., Very Often/Often, Sometimes/Rarely, and Never) was found to have the optimal categorization. Zhu (2002) found scalewas modified from its original five-category structure ("Very often" = 1, "Often" = 2, "Sometimes" = 3, "Rarely" = 4, and "Never" = 5) to a three-category structure ("Very Often" = 1, "Sometimes" = 2, and "Never" = 3). These results verified that the characteristics of the optimal categorization identified by the Rasch post-hoc analysis can be maintained after the original scale was modified based on such an analysis. Myers, Feltz, and Wolfe (2008): used confirmatory studied for the category effectiveness of Rating Scale in coaching efficacy scale. They discussed the structure of optimal categorization, and concluded two approaches: 1. Exploratory study was to identify an effective categorizations structure through exploratory post hoc methods. 2. Confirmatory study was to confirm the effectiveness of this categorizations structure in a subsequent study. These were the study of psychological feeling, which category setting belonged to semantic field and was irrelevant to division areas of specific in sport skill testing, and the reasonable category setting of subjects was different. 6. The study of step disorder in sports testing. Chou and Yau (2006) An evaluation of the assessing the development level of the Standing Long Jump Observation Checklist (SLJOC). The average measure of the category 2 of the item 3 was This disorder in the average measures of categories might imply the disorder in the category definitions. Chou and Yau focused on evaluation, so only showed the disorder phenomenon and attributed to the problem of category definition, but did not try to find the solution to the problem. Yau et al. (2008) revision of the assessing the developmental level of the SLJOC, aimed at disorder when estimation, it used method of category combination to combine the second category and the third category in the second item, and the first category and the second category in both the third and the sixth items. The revised estimation got a better data-model fit. The solution to the problem in Yau et al. research is the method of combination category tactic from Bond and Fox (2007). On the other view, this kind of combination category tactic was a worse way to solve the problem. If we combined the categories constantly, it would make the discrimination reducing, which disobeyed the goal of category setting. Besides, in Chen (2008) a construction of forehand driving test for first level athletes in Table Tennis, this research was to find the proper category setting by testing and revising three times, but lack of the support of theory and concrete data it reluctantly showed there was not a disorder problem referring to the estimation from the software so we could use Rasch model to analyze data. Obviously we have to the right method of category setting that could solve problem thoroughly. Then Yau (2010) developed a reasonable categories setting approach in sport skill testing, but that could not know which distribution theory was better to use. Proper category distribution theory is the key to complete measurement successfully. So that building theory distribution of rating scale category in Rasch model is the remedy for solving disorder problem. Based on the trouble from the reference study, clear demand is to develop category 4
7 setting method of sports skill testing. When we use the Rasch model to construct testing, there is no practical category setting method about distribution ratio of rating scale category. If the estimation brings category disorder, we only have unwise method to apply. So it is the fundamental and better way to discuss the category distribution theory of testing construction that it could build reasonable theory distribution. Therefore the purpose of this study is to simulate the best theory distribution of rating scale category in Rasch model. Methods Objects of study Normal distribution, binomial distribution and logistic distribution are the relative theory distribution of rating scale category in Rasch model. The subjects of this study were above distribution and uniform distribution of proper category which was researched from Linacre (2002). Focused on logistic distribution firstly, Rasch (1960) supervised the process of over-time reading for student, and found the common measurement of students reading promotion measurement that it could be the tool to estimate reading ability accurately. This research started to use Poisson model to solve the problem of number of errors read and defined each ability and difficulty, (a person p of ability B p, a text t of difficulty D t ), which nature log was used and cumulate function was logistic distribution. Secondly, dichotomous model was distributed by Bernoulli distribution and the rating scale was ordinal category which was distributed by Binomial distribution. So we brought Binomial distribution into the subject of distribution theory study. For normal distribution, the practical data were distributed by normal distribution because one of the characteristics for mother group was distributed by normal distribution. Normal distribution was approximate to both binomial distribution and Poisson distribution (Ho & Yau, 1997). Besides, most of statistic problems were usually solved by normal distribution, so normal distribution was the subject of this study. Finally, for proper category range from Linacre (2002), Guideline 2 showed regular observation distribution: Iregularity in observation frequency across categories may signal aberant category usage. A uniform distribution of observations across categories is optimal for step calibration. Therefore, uniform distribution was concerned in this study. Procedures This study took an eye on categorical theory distribution by simulation, and used statistic software to simulate data and then estimated and verified category under each theoretical distribution. The followings were the processes of data simulation for Normal Distribution, Logistic Distribution, Binomial Distribution and Uniform Distribution, and the quality of category estimation under evaluating different theoretical distribution. 5
8 Data We used the normal, binomial, and ranuni functions from SAS to produce random data which were distributed by normal distribution, binomial distribution, and uniform distribution, and saved them as SAS permanent file (extend file named sas7bdat). Then we used logistic function from Minitab14 to produce random data which were distributed by logistic distribution and saved it as a Minitab14 file and transferred it to Excel file directly. Rasch Analysis This study used Winsteps to read SAS permanent file and Excel file to build Winsteps program and data file, then we could estimate parameter by Winsteps which evaluated the data-model fit in simulated data. Design There were four factors in the simulation study: distribution theory(dt), sampling frequency(sf), sample size(sp), and test length(tl). Distribution theory was decided by Rasch score measurement model, and the design of the study could be the best judgment by generalizability theory, which the proper level of other 3 factors should be decided first. At the pilot study, we defined 5 times random sampling (S), which drew 30 samples (P) each time, and the test length was ten items (I). The analysis model of generalizability theory was ( P : S ) I that meant P nested within S, and (P:S) crossed with I. The table 1 was result of G-Study. Table 1 Results of Generalizability Study Source of Variation Percent of Variance Components Binomial Logistic Normal Uniform Mean S 0.00% 0.00% 0.00% 0.00% 0.00% P:S 0.75% 2.99% 0.00% 0.00% 0.94% I 0.85% 0.00% 0.05% 0.02% 0.23% SI 0.55% 1.02% 0.00% 0.00% 0.39% PI:S 97.84% 95.99% 99.95% 99.98% 98.44% Total % % % % % Notes: P: sample size.s: sampling frequency. I: test length. By generalization study, this study found the percentage of variance components in the source of variation (SV). The mean of random error (PI:S) was 98.44% that occupied most of percentage, but the percentage of other SV were less than 1%. That showed the simulated random sampling data of this study fit random sampling theory. Secondly, the mean SV of sample size nested within sampling frequency (P:S) was 0.94% that was the second high value, and the mean SV of sampling frequency crossed with test length (SI) was 0.39% that was the third high value, and the mean SV of test length (I) was 0.23% that was the forth high value. Finally, the mean SV of sampling frequency (S) was proximate to zero that did not matter to us. 6
9 Based on the results of generalization study, the average source of variation of sample size occupied second large percentage because we set large sample (N=1000), median sample (N=300) and small sample (N=30) for each distribution that we concerned. The average SV of testing length occupied forth large percentage, so we defined testing length into short form (I=10) and long form (I=50). The average SV of sampling frequency approached zero so we did random sampling five times. The whole study samples: each combination simulation did random sampling five times, which produced 120 types of simulation data. The simulation data were sixed-categories data named from 0 to 5, and we compared the accuracy of parameter estimations, category threshold, and which category to score. The comparison showed the accuracy of parameter estimations, how many times and which category threshold disorder was produced, and which statistical distribution the data were distributed by. Besides the judgments suggested by Linacre (2002), this study brought up judgment indexes. Analysis This study evaluated the fit of random data that produced by different distribution theories by Rasch analysis and quality of categories by Winsteps software. From the estimation, we could obtain the measurement values of item, subject, and category, and estimated standard error, INFIT and OUTFIT, which explained the information of test (Reliability). The data (item, subject) model fit could prove the validly of test. The most important thing is to evaluate the random data that produced by each theoretical distributions, and the information for categorical frequency, percentage, step and threshold. We use the reasonable category standard (Linacre, 2002) to be the judgment on evaluating Rasch rating scale model. Results The result of the study was to analyze the data-model fit first and estimation accuracy. Then we analyzed the counts of ideal step difficulty, counts of category observation and disordering of step calibration in category estimation. Finally we used the above results to be comprehensive discussion. Category Fit Statistics Firstly we analyzed the data-model fit by Rasch measurement. When data fit the Rasch model, measurement was with validity and further inference was meaningful. Linacre (2002) Guideline 4: OUTFIT mean-squares less than 2.0. This judgment suggested the lowest category fit standard. The category fit of four kinds of simulation data from distribution theories were estimated by Rasch rating scale model. The following was the analysis result (table 2): (1) for normal distribution, the INFIT mean square is 0.997, the standard deviation is For logistic distribution, the INFIT mean square is 0.997, the standard deviation is For binomial distribution, the INFIT mean square is 0.999, the standard deviation is For 7
10 uniform distribution, the INFIT mean square is 0.997, the standard deviation is (2) For normal distribution, the OUTFIT mean square is 0.997, the standard deviation is For logistic distribution, the OUTFIT mean square is 0.997, the standard deviation is For binomial distribution, the OUTFIT mean square is 0.999, the standard deviation is For uniform distribution, the OUTFIT mean square is 0.996, the standard deviation is (3) Comparison of category fit for 4 distribution theories, the F value of INFIT mean square was and the F value of OUTFIT mean square was by ANOVA, and both of them did not reach significant level. Therefore the expectation of INFIT and the mean of OUTFIT approached to 1, and fit Guideline 4 the mean square of fit statistics of OUTFIT was less than 2.0. (Linacre, 2002) It meant estimated category of 4 distribution theories fit Rasch model, and the fit of category estimation was well and had measurement validity that it could be valid inference. Table 2 Fit Statistics of category in simulation data Fit Statistics Statistics Normal Logistic Binomial Uniform F Value INFIT Mean MNSQ SD OUTFIT Mean MNSQ SD * p <.05 Estimated standard error of categories We could know the accuracy of measurement estimation by category estimation standard error, and the following was the result of category estimation standard error by Rasch rating scale model (table 3 ) : (1) the interactions of three-factored ANOVA, DT SP TL, DT SP and the mean effect DT, SP,and TL reached significant level(p <.05). (2) The ω square of distribution theory was 0.069, and the ω square of sample size was 0.060, the ω square of test length was 0.044, the ω square of DT SP was 0.004, and the ω square of DT SP TL was Keppel (1991) pointed in the behavior science field, it was large when the ω square was 0.15, and medium whenω square was 0.06, and smal when the ω square was Therefore for effect size, distribution theory and sample size belonged to medium efect size, and test length belonged to smal efect size. DT SP and DT SP TL al belonged to smal efect size. (3) For the post hoc comparison of source of variation of medium efect size, by Schefe test distribution theory we found: (A) The estimated standard error of category estimation under Normal was which was larger than under Binomial, under Uniform and under Logistic. (B) The estimated standard error of category estimation under Binomial which was larger than under Logistic. Therefore the estimated standard error of category estimation under Normal was larger than the others obviously, and it was the smallest under Logistic, That meant the accuracy of category estimation was worse under Normal but better under Logistic. (4) The result of sample size of medium effect size in post hoc comparison by Schefe 8
11 test was: (A) Under small sample size the standard error of category estimation was which was bigger than under median sample size. (B) Under median sample size the standard error of category estimation was which was bigger than under large sample size. Therefore the standard error of category estimation was affected by sample size. The bigger the sample size was, the smaller the standard error of category estimation was. (5) The result of test length of effect size small in post hoc comparison by Schefe test was: the standard error of category estimation under short form was which was bigger than under long form, and it showed that the standard error of category estimation was affected by test length. When test length increased, the standard error of category estimation became smaller. (6) The effect size of DT SP and DT SP TL were tiny, so it was meaningless to discuss further even with test significant. Table 3 Summary of 3W ANOVA for estimated standard error in simulation data Source of Variation DF Type III SS MS F Value ω 2 a(dt) * b(sp) * a*b(dt SP) * c(tl) * a*c(dt TL) b*c(sp TL) a*b*c(dt SP TL) * Error Total Notes: * p <.05 DT= distribution theory SP= sample size TL= test length Step difficulties Based on proper category criterion by Linacre (2002): Guideline 7: Step dificulties advance by at least 1.4 logits. According to theproper criterion, we had to test the step difficulty of simulation data produced from 4 kinds of distribution theories whether the step dificulty was at the range from 1.4 to 5.0 logit by one sample t test, μ=1.4. (table 4) Table 4 Summary of one sample t-test (μ=1.4) for estimated step dificulty Distribution N Mean S D S Err t Value Normal * Logistic * Binomial * Uniform * * p <.05 One sample t test was used to test the step difficulty, and the null hypothesis isμ=1.4 and showed that all the test all reached the significant level (p <.05) : the mean of Normal is
12 which is bigger than 1.4 obviously, and the mean of Logistic, Binomial, and Uniform are 1.06, 0.87,and 0.03 which are all smaller than 1.4, normal distribution fit these two criterions only. Then we checked whether each observation value fit these two criterions or not. For the evaluation of the optimizing step difficulty, we obtained such a result for the frequency and percentage of optimizing step difficulty in simulation data. (table 5) Table 5 The frequency and percentage of optimizing step difficulty in simulation data sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 20(100%) 5(25%) 0(0%) 0(0%) (N=1000) short (I=10) 20(100%) 6(30%) 0(0%) 0(0%) medium long (I=50) 20(100%) 5(25%) 0(0%) 0(0%) (N=300) short (I=10) 19(95%) 9(45%) 0(0%) 0(0%) 3000 small long (I=50) 16(80%) 9(45%) 0(0%) 0(0%) 1500 (N=30) short (I=10) 10(50%) 10(50%) 4(20%) 0(0%) 300 Total 105(88%) 44(37%) 4(3%) 0(0%) Comprehensively, the step difficulty of Normal distribution simulation data fit model best. Decreasingly, there were 105, 44, 4, 0 observations in optimizing step difficulty for Normal, Logistic, Binomial, Uniform whose percentages were 88%, 37%, 3%, 0%. By there results we found: (1) Normal distribution was in best optimizing step difficulty when large or median sample sizes, but it became worse when small sample size. (2) Logistic distribution differed from Normal distribution, which optimizing step difficulty became increasing when small sample size. (3) The optimizing step difficulty of binomial distribution reached 20% only when small sample size and short form. (4) There was no optimizing step difficulty for uniform distribution, which this distribution did not fit Rasch measurement model. Category Frequencies Data of simulation study were produced by setting parameters randomly, and each amount of observations was produced randomly, as well. Based on proper category guideline by Linacre (2002) Guideline 1 said there were ten observations at least for each category, which was the lower limit. If there was no observation in it called category empty, it could affect the test validity and reliability, and combination of calibration disorder category approximately. Firstly we simulated data which were less than 10 observations for each distribution theories, and the results showed in the table 6. 10
13 Table 6 Summary of the number of categories for less than 10 sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 0(0%) 0(0%) 0(0%) 0(0%) (N=1000) short (I=10) 10(33%) 0(0%) 0(0%) 0(0%) medium long (I=50) 8(27%) 0(0%) 0(0%) 0(0%) (N=300) short (I=10) 10(33%) 0(0%) 0(0%) 0(0%) 3000 small long (I=50) 10(33%) 1(3%) 0(0%) 0(0%) 1500 (N=30) short (I=10) 12(40%) 10(33%) 6(20%) 0(0%) 300 Total 50(28%) 11(6%) 6(3%) 0(0%) For distribution theory: (1) For Normal, there were 50 categories (28%) whose amount of data was less than 10, which only large sample size and long form fit the guideline 1 called by Linacre. (2) For Logistic, there were 11 categories (6%) whose amount of data was less than 10, which happened to small sample size. (3) For Binomial, there were 6 categories (3%) whose amount of data was less than 10, which happened to small sample size and short form. (4) For Uniform, there was no category whose amount of data was less than 10. Therefore, For Normal, there were over one-forth categories dissatisfying Linacre guideline 1, which was the worst situation. For Logistic and Binomial, they dissatisfied Linacre guideline 1 only when small sample size, which were better than Normal. Only for Uniform, it satisfied Linacre guideline 1 totally. Table 7 Summary of the frequency and percentage for missing category sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 0(0%) 0(0%) 0(0%) 0(0%) (N=1000) short (I=10) 0(0%) 0(0%) 0(0%) 0(0%) medium long (I=50) 0(0%) 0(0%) 0(0%) 0(0%) (N=300) short (I=10) 0(0%) 0(0%) 0(0%) 0(0%) 3000 small long (I=50) 3(10%) 0(0%) 0(0%) 0(0%) 1500 (N=30) short (I=10) 10(33%) 0(0%) 0(0%) 0(0%) 300 Total 13(7%) 0(0%) 0(0%) 0(0%) Then, we analyzed the amount of estimation category missing for each distribution theory, and the following was the result (table 7): (1) For Normal, there were 3 categories missing (10%) at small sample size and long form, and 10 categories missing (33%) at small sample size and short form. (2) For Logistic, Binomial and Uniform, there was no category missing. Therefore based on sample size, when amount of response was 1500, normal distribution had category missing and decreasing, which was a defect, but the others had no category missing. The frequency of steps calibration disorder Linacre (2002) Guideline 3: Average measures advance monotonically with category, and Guideline 5: Step calibrations advance. Based on these guidelines, we judged the steps 11
14 calibration disorder of simulation data for 4 distribution theories. The results were in table 8: (1) Step calibration disorder did not happen to normal distribution, but both logistic and binomial distributions had 1 step calibration disorder only, and uniform distributions had 47 step calibration disorder. (2) Step calibration disorder was happened to logistic and binomial distributions under small sample size (amount of responses was under 3000). (3) For uniform distribution, step calibration disorder was happened under both sample size and test length combination, which the percentage of disorder accounts were from 30% to 45%. Therefore, for the step calibration disorder of data estimation, normal distribution fit Rasch model. And for both logistic and binomial distributions, step calibration disorder was happened only under small sample size, but uniform distribution did not fit Rasch model. Table 8 Summary of the frequency and percentage for steps calibration disorder sample size test length Normal Logistic Binomial Uniform Response large long (I=50) 0(0%) 0(0%) 0(0%) 6(30%) (N=1000) short (I=10) 0(0%) 0(0%) 0(0%) 8(40%) medium long (I=50) 0(0%) 0(0%) 0(0%) 8(40%) (N=300) short (I=10) 0(0%) 0(0%) 0(0%) 8(40%) 3000 small long (I=50) 0(0%) 1(5%) 0(0%) 9(45%) 1500 (N=30) short (I=10) 0(0%) 0(0%) 1(5%) 8(40%) 300 Total 0(0%) 1(1%) 1(1%) 47(39%) Discussion This study simulated data produced by normal distribution, logistic distribution, binomial distribution and uniform distribution, and evaluated the best theoretical distribution of Rasch rating scale model category judged from the proper category criterion, Linacre (2002). First, the result of category fit statistics: categories all fit Rasch model for 4 distribution theories, and the fit of category estimation was well, which had measurement validity inference. Secondly, we compared the category estimation standard error for each distribution theory and results showed: (1) The accuracy of category estimation was worse under normal distribution, but better under logistic distribution. (2) Sample size affected the standard error of category estimation. The bigger the sample size is, the smaller the standard error of category estimation is. (3) Test length affected the standard error of category estimation, as well. The longer the test length is, the smaller the standard error of category estimation is. The third comparison was for the step difficulty of simulated data estimation for each distribution theory, and the result showed: Normal distribution was under best ideal step difficulty when large and median sample size, but opposite for logistic distribution, the phenomenon of optimizing step difficulty decreased when small sample size. Both binomial and uniform distribution did not fit Rasch measurement model. The forth analysis was for the amount of missing, and the result showed: Category missing was happened to normal distribution when the response accounts were 1500, and the categories decreased, which was a obvious defect. There was no category missing for other 3 12
15 distribution theories. The fifth analysis was for the step calibration disorder, and the results showed that normal distribution fit Rasch model, and the step calibration disorder was happened to both logistic and binomial distributions under small sample size, but uniform distribution did not fit Rasch model. The results were: (1) Normal distribution was the ideal distribution when sample size was over 3000 in response data. (2) Logistic distribution was the better distribution when sample size was less than 1500 in response data. The conclusion of the study was that optimizing distribution of rating scale category in Rasch model was related to sample size. This study (NSC H ) was sponsored by The National Security Council, R.O.C. (Taiwan). References Andrich, D. (1996). Category Ordering and their Utility. Rasch Measurement Transactions, 9(4), Bond, T. G., & Fox, C. M. (2007). Applying The Rasch Model: Fundamental Measurement in the Human Sciences(2nd). Mahwah, NJ: Lawrence Erlbaum Associates. Chen, W. C. (2008). A construction of forehand driving test for first level athletes in Table Tennis. Unpublished doctoral dissertation, National Taiwan Sport University(Taoyuan), Graduate Institute of Sports Training Science, Taoyuan. Chou, S. I. & Yau, H. D. (2006). An Evaluation of the Assessing the Development Level of the Standing Long Jump Observation Checklist. Paper presented at the 2nd Pacific Rim Objective Measurement Symposium, PROMS HK 2006, Hong Kong Institute of Education, Hong Kong , June. Ho, R. G., & Yau, H. D. (1997). The relationship between Binomial, Normal and Poisson distribution in the psychomotor test. Psychological Testing, 44(1), Keppel, G.. (1991).Design and analysis: A researcher s handbook (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Linacre, J. M. (1991). Step Disordering and Rasch Thurstone Thresholds. Rasch Measurement Transactions, 5(3), 171. Linacre J.M. (2001). Category, Step and Threshold: Definitions & Disordering. Rasch Measurement Transactions, 15(1), 794. Linacre, J. M. (2002). Understanding Rasch Measurement: Optimizing Rating Scale Category Effectiveness. The Journal of Applied Measurement, 3(1), Linacre, J. M. (2010). Transitional Categories and Usefully Disordered Thresholds. Online Educational Research Journal, 1. Retrieved March 28, 2011, from Myers, N. D., Feltz, D. L., & Wolfe, E. W. (2008). A Confirmatory Study of Rating Scale Category Effectiveness for the Coaching Efficacy Scale. Research Quarterly for Exercise and Sport, 79(3), 13
16 Pesudovs, K., Gothwal, V. K., Wright, T., & Lamoureux, E. L. (2010). Remediating serious flaws in the National Eye Institute Visual Function Questionnaire. Journal Cataract Refract Surg, 36, Rasch, G. (1960/80). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research). Expanded edition (1980) with foreword and afterword by B.D. Wright, (1980). Chicago: The University of Chicago Press. Siegert, R. J., Jackson, D. M., Tennant, A., & Turner-Stokes, L. (2010). Factor analysis and Rasch analysis of the Zarit Burden Interview for acquired brain injury carer research. Journal Rehabilitation Medicine, 42, Shaw, F., Wright, B., & Linacre, J. M. (1992). Disordered Steps? Rasch Measurement Transactions, 6(2), 225. Stone, M. H., & Wright, B. D. (1994). Maximizing rating scale information. Rasch Measurement Transactions, 8, 386. Tennant, A. (2004). Disordered Thresholds: An example from the Functional Independence Measure, Rasch Measurement Transactions, 17(4), Van Lente, E., Karabatsos, G., & Uekawa, K. (2010). A bootstrap approach to rating scale optimization. In W.P.F. Fisher & M. Wilson, (Eds.), Access to the Foundations of Measurement: Professional Identity in the Career of Benjamin D. Wright (forthcoming). Retrieved April 1, 2011, from Yau, H. D. (2010). A study of reasonable categories setting approach in sport skill testing: An example of Multiple-Attempt Single-Item tests. Research Project Report of National Science Council (NSC H ). Yau, H. D., Chi, S. C., Chou, S. I., & Yao, W. C. (2008). A revision of the assessing developmental level of the standing long jump. Journal of National Taiwan Sports University, 19(1), (NSC H ) Zhu, W., Updyke, W. F., & Lewandowski, C. (1997). Post-Hoc Rasch analysis of optimal categorization of an ordered-response scale. Journal of Outcome Measurement, 1(4), Zhu, W., & Kang, S. J. (1998). Cross-cultural stability of the optimal categorization of a self-efficacy scale: A Rasch analysis. Measurement in Physical Education and Exercise Science, 2, Zhu, W., Timm, G., & Ainsworth, B. (2001). Rasch calibration and optimal categorization of an instrument measuring women s exercise perseverance and bariers. Research Quarterly for Exercise and Sport, 72, Zhu, W. (2002). A confirmatory study of Rasch-based optimal categorization of a rating scale. Journal of Applied Measurement, 3,
RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study
RATER EFFECTS AND ALIGNMENT 1 Modeling Rater Effects in a Formative Mathematics Alignment Study An integrated assessment system considers the alignment of both summative and formative assessments with
More informationMEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS
MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS Farzilnizam AHMAD a, Raymond HOLT a and Brian HENSON a a Institute Design, Robotic & Optimizations (IDRO), School of Mechanical
More informationEvaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching
Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Kelly D. Bradley 1, Linda Worley, Jessica D. Cunningham, and Jeffery P. Bieber University
More informationUsing the Rasch Modeling for psychometrics examination of food security and acculturation surveys
Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,
More informationMeasurement issues in the use of rating scale instruments in learning environment research
Cav07156 Measurement issues in the use of rating scale instruments in learning environment research Associate Professor Robert Cavanagh (PhD) Curtin University of Technology Perth, Western Australia Address
More informationMeasuring the External Factors Related to Young Alumni Giving to Higher Education. J. Travis McDearmon, University of Kentucky
Measuring the External Factors Related to Young Alumni Giving to Higher Education Kathryn Shirley Akers 1, University of Kentucky J. Travis McDearmon, University of Kentucky 1 1 Please use Kathryn Akers
More informationConstruct Validity of Mathematics Test Items Using the Rasch Model
Construct Validity of Mathematics Test Items Using the Rasch Model ALIYU, R.TAIWO Department of Guidance and Counselling (Measurement and Evaluation Units) Faculty of Education, Delta State University,
More informationPresented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College
Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Background of problem in assessment for elderly Key feature of CCAS Structural Framework of CCAS Methodology Result
More informationCentre for Education Research and Policy
THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An
More informationPsychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model
Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model Curt Hagquist Karlstad University, Karlstad, Sweden Address: Karlstad University SE-651 88 Karlstad Sweden
More informationValidation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach
Tabaran Institute of Higher Education ISSN 2251-7324 Iranian Journal of Language Testing Vol. 3, No. 1, March 2013 Received: Feb14, 2013 Accepted: March 7, 2013 Validation of an Analytic Rating Scale for
More informationDescription of components in tailored testing
Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of
More informationAuthor s response to reviews
Author s response to reviews Title: The validity of a professional competence tool for physiotherapy students in simulationbased clinical education: a Rasch analysis Authors: Belinda Judd (belinda.judd@sydney.edu.au)
More informationMeasuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University
Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety
More informationDetecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker
Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National
More informationThe Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign
The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign Reed Larson 2 University of Illinois, Urbana-Champaign February 28,
More informationThe Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests
The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University
More informationThe Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective
Vol. 9, Issue 5, 2016 The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Kenneth D. Royal 1 Survey Practice 10.29115/SP-2016-0027 Sep 01, 2016 Tags: bias, item
More informationComparing standard toughness through weighted and unweighted scores by three standard setting procedures
Comparing standard toughness through weighted and unweighted scores by three standard setting procedures Abstract Tsai-Wei Huang National Chiayi University, Taiwan Ayres G. D Costa The Ohio State University
More informationMaximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean
Maximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean Paul R. Yarnold, Ph.D., Fred B. Bryant, Ph.D., and Robert C. Soltysik, M.S. Optimal Data Analysis, LLC
More informationThe following is an example from the CCRSA:
Communication skills and the confidence to utilize those skills substantially impact the quality of life of individuals with aphasia, who are prone to isolation and exclusion given their difficulty with
More informationConstruct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale
University of Connecticut DigitalCommons@UConn NERA Conference Proceedings 2010 Northeastern Educational Research Association (NERA) Annual Conference Fall 10-20-2010 Construct Invariance of the Survey
More informationTHE COURSE EXPERIENCE QUESTIONNAIRE: A RASCH MEASUREMENT MODEL ANALYSIS
THE COURSE EXPERIENCE QUESTIONNAIRE: A RASCH MEASUREMENT MODEL ANALYSIS Russell F. Waugh Edith Cowan University Key words: attitudes, graduates, university, measurement Running head: COURSE EXPERIENCE
More informationThe outcome of cataract surgery measured with the Catquest-9SF
The outcome of cataract surgery measured with the Catquest-9SF Mats Lundstrom, 1 Anders Behndig, 2 Maria Kugelberg, 3 Per Montan, 3 Ulf Stenevi 4 and Konrad Pesudovs 5 1 EyeNet Sweden, Blekinge Hospital,
More informationMETHODS. Participants
INTRODUCTION Stroke is one of the most prevalent, debilitating and costly chronic diseases in the United States (ASA, 2003). A common consequence of stroke is aphasia, a disorder that often results in
More informationRUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s
RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT Evaluating and Restructuring Science Assessments: An Example Measuring Student s Conceptual Understanding of Heat Kelly D. Bradley, Jessica D. Cunningham
More informationEmpirical Formula for Creating Error Bars for the Method of Paired Comparison
Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science
More informationValidating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky
Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University
More informationBy Hui Bian Office for Faculty Excellence
By Hui Bian Office for Faculty Excellence 1 Email: bianh@ecu.edu Phone: 328-5428 Location: 1001 Joyner Library, room 1006 Office hours: 8:00am-5:00pm, Monday-Friday 2 Educational tests and regular surveys
More informationEye Movements, Strabismus, Amblyopia, and Neuro-Ophthalmology. Evaluation of the Adult Strabismus-20 (AS-20) Questionnaire Using Rasch Analysis
Eye Movements, Strabismus, Amblyopia, and Neuro-Ophthalmology Evaluation of the Adult Strabismus-20 (AS-20) Questionnaire Using Rasch Analysis David A. Leske, Sarah R. Hatt, Laura Liebermann, and Jonathan
More informationAPSYCHOMETRIC STUDY OF THE MODEL OF HUMAN OCCUPATION SCREENING TOOL (MOHOST)
HKJOT 2010;20(2):63 70 ORIGINAL ARTICLE APSYCHOMETRIC STUDY OF THE MODEL OF HUMAN OCCUPATION SCREENING TOOL (MOHOST) Gary Kielhofner 1, Chia-Wei Fan 2, Mary Morley 3, Mike Garnham 4, David Heasman 3, Kirsty
More informationResearch and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida
Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality
More informationRunning head: PRELIM KSVS SCALES 1
Running head: PRELIM KSVS SCALES 1 Psychometric Examination of a Risk Perception Scale for Evaluation Anthony P. Setari*, Kelly D. Bradley*, Marjorie L. Stanek**, & Shannon O. Sampson* *University of Kentucky
More informationGMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups
GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics
More informationThe Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven
Introduction The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven outcome measure that was developed to address the growing need for an ecologically valid functional communication
More informationChapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE
Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive
More informationPTHP 7101 Research 1 Chapter Assignments
PTHP 7101 Research 1 Chapter Assignments INSTRUCTIONS: Go over the questions/pointers pertaining to the chapters and turn in a hard copy of your answers at the beginning of class (on the day that it is
More informationUnit 1 Exploring and Understanding Data
Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile
More informationThe validity of polytomous items in the Rasch model The role of statistical evidence of the threshold order
Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 377-395 The validity of polytomous items in the Rasch model The role of statistical evidence of the threshold order Thomas Salzberger 1
More informationModeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing
James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination
More informationThe Use of Rasch Wright Map in Assessing Conceptual Understanding of Electricity
Pertanika J. Soc. Sci. & Hum. 25 (S): 81-88 (2017) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ The Use of Rasch Wright Map in Assessing Conceptual Understanding of Electricity
More informationValidation the Measures of Self-Directed Learning: Evidence from Confirmatory Factor Analysis and Multidimensional Item Response Analysis
Doi:10.5901/mjss.2015.v6n4p579 Abstract Validation the Measures of Self-Directed Learning: Evidence from Confirmatory Factor Analysis and Multidimensional Item Response Analysis Chaiwichit Chianchana Faculty
More informationINTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics
INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT Basic Concepts, Parameters and Statistics The designations employed and the presentation of material in this information product
More informationInvestigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories
Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,
More informationAnswers to end of chapter questions
Answers to end of chapter questions Chapter 1 What are the three most important characteristics of QCA as a method of data analysis? QCA is (1) systematic, (2) flexible, and (3) it reduces data. What are
More informationModel fit and robustness? - A critical look at the foundation of the PISA project
Model fit and robustness? - A critical look at the foundation of the PISA project Svend Kreiner, Dept. of Biostatistics, Univ. of Copenhagen TOC The PISA project and PISA data PISA methodology Rasch item
More informationEvaluation of the Short-Form Health Survey (SF-36) Using the Rasch Model
American Journal of Public Health Research, 2015, Vol. 3, No. 4, 136-147 Available online at http://pubs.sciepub.com/ajphr/3/4/3 Science and Education Publishing DOI:10.12691/ajphr-3-4-3 Evaluation of
More information11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES
Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are
More informationStudents' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience
University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2016 Students' perceived understanding and competency
More informationShiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )
Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement
More informationMeasurement Issues in Concussion Testing
EVIDENCE-BASED MEDICINE Michael G. Dolan, MA, ATC, CSCS, Column Editor Measurement Issues in Concussion Testing Brian G. Ragan, PhD, ATC University of Northern Iowa Minsoo Kang, PhD Middle Tennessee State
More informationExamining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*
Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology* Timothy Teo & Chwee Beng Lee Nanyang Technology University Singapore This
More informationAnalysis of the Reliability and Validity of an Edgenuity Algebra I Quiz
Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative
More informationISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology
ISC- GRADE XI HUMANITIES (2018-19) PSYCHOLOGY Chapter 2- Methods of Psychology OUTLINE OF THE CHAPTER (i) Scientific Methods in Psychology -observation, case study, surveys, psychological tests, experimentation
More informationOn the Construct Validity of an Analytic Rating Scale for Speaking Assessment
On the Construct Validity of an Analytic Rating Scale for Speaking Assessment Chunguang Tian 1,2,* 1 Foreign Languages Department, Binzhou University, Binzhou, P.R. China 2 English Education Department,
More informationCOMPARISON OF DIFFERENT SCALING METHODS FOR EVALUATING FACTORS IMPACT STUDENTS ACADEMIC GROWTH
International Journal of Innovative Management Information & Production ISME International c 2014 ISSN 2185-5455 Volume 5, Number 1, March 2014 PP. 62-72 COMPARISON OF DIFFERENT SCALING METHODS FOR EVALUATING
More informationChecking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior
1 Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior Gregory Francis Department of Psychological Sciences Purdue University gfrancis@purdue.edu
More informationInterpersonal Citizenship Motivation: A Rating Scale Validity of Rasch Model Measurement
Interpersonal Citizenship Motivation: A Rating Scale Validity of Rasch Model Measurement Shereen Noranee, Noormala Amir Ishak, Raja Munirah Raja Mustapha, Rozilah Abdul Aziz, and Rohana Mat Som Abstract
More informationHow Do We Assess Students in the Interpreting Examinations?
How Do We Assess Students in the Interpreting Examinations? Fred S. Wu 1 Newcastle University, United Kingdom The field of assessment in interpreter training is under-researched, though trainers and researchers
More informationOptimizing Rating Scale Category Effectiveness. John M. Linacre MESA Psychometric Laboratory University of Chicago
Optimizing Rating Scale Category Effectiveness John M. Linacre MESA Psychometric Laboratory University of Chicago Journal of Applied Measurement 3:1 2002 p.85-106. Investigating... p. 1 Abstract Rating
More informationA Comparison of Several Goodness-of-Fit Statistics
A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures
More informationChapter 12: Introduction to Analysis of Variance
Chapter 12: Introduction to Analysis of Variance of Variance Chapter 12 presents the general logic and basic formulas for the hypothesis testing procedure known as analysis of variance (ANOVA). The purpose
More informationImplicit Information in Directionality of Verbal Probability Expressions
Implicit Information in Directionality of Verbal Probability Expressions Hidehito Honda (hito@ky.hum.titech.ac.jp) Kimihiko Yamagishi (kimihiko@ky.hum.titech.ac.jp) Graduate School of Decision Science
More informationChapter 3. Psychometric Properties
Chapter 3 Psychometric Properties Reliability The reliability of an assessment tool like the DECA-C is defined as, the consistency of scores obtained by the same person when reexamined with the same test
More informationRasch Model Analysis On Teachers Epistemological Beliefs
Rasch Model Analysis On Teachers Epistemological Beliefs Amar Ma ruf & Mohamed Najib Abdul Ghafar & Samah Ali Mohsen Mofreh Abstract Epistemological Beliefs are fundamental assumptions about the nature
More informationChapter 9. Youth Counseling Impact Scale (YCIS)
Chapter 9 Youth Counseling Impact Scale (YCIS) Background Purpose The Youth Counseling Impact Scale (YCIS) is a measure of perceived effectiveness of a specific counseling session. In general, measures
More informationApproaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM)
Approaches for the Development and Validation of Criterion-referenced Standards in the Korean Health Literacy Scale for Diabetes Mellitus (KHLS-DM) Kang Soo- Jin, RN, PhD, Assistant Professor Daegu University,
More informationStatistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions
Readings: OpenStax Textbook - Chapters 1 5 (online) Appendix D & E (online) Plous - Chapters 1, 5, 6, 13 (online) Introductory comments Describe how familiarity with statistical methods can - be associated
More informationA Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model
A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson
More informationUSE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1
Ecology, 75(3), 1994, pp. 717-722 c) 1994 by the Ecological Society of America USE AND MISUSE OF MIXED MODEL ANALYSIS VARIANCE IN ECOLOGICAL STUDIES1 OF CYNTHIA C. BENNINGTON Department of Biology, West
More informationThe Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016
The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 This course does not cover how to perform statistical tests on SPSS or any other computer program. There are several courses
More informationData and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data
TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2
More informationSelection of Linking Items
Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,
More informationLatent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model
Chapter 7 Latent Trait Standardization of the Benzodiazepine Dependence Self-Report Questionnaire using the Rasch Scaling Model C.C. Kan 1, A.H.G.S. van der Ven 2, M.H.M. Breteler 3 and F.G. Zitman 1 1
More informationYou must answer question 1.
Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose
More informationPower of the test of One-Way Anova after transforming with large sample size data
Available online at www.sciencedirect.com Procedia Social and Behavioral Sciences 9 (2010) 933 937 WCLTA-2010 Power of the test of One-Way Anova after transforming with large sample size data Natcha Mahapoonyanont
More informationTeaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers
Journal of Mathematics and System Science (01) 8-1 D DAVID PUBLISHING Teaching A Way of Implementing Statistical Methods for Ordinal Data to Researchers Elisabeth Svensson Department of Statistics, Örebro
More informationTHE FIRST VALIDITY OF SHARED MEDICAL DECISIONMAKING QUESTIONNAIRE IN TAIWAN
THE FIRST VALIDITY OF SHARED MEDICAL DECISIONMAKING QUESTIONNAIRE IN TAIWAN Chi-CHANG CHANG 1 1 School of Medical Informatics, Chung Shan Medical University, and Information Technology Office of Chung
More informationThe Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory
The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory Kate DeRoche, M.A. Mental Health Center of Denver Antonio Olmos, Ph.D. Mental Health
More informationReferences. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,
The Western Aphasia Battery (WAB) (Kertesz, 1982) is used to classify aphasia by classical type, measure overall severity, and measure change over time. Despite its near-ubiquitousness, it has significant
More informationAssessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.
Running head: ASSESS MEASUREMENT INVARIANCE Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies Xiaowen Zhu Xi an Jiaotong University Yanjie Bian Xi an Jiaotong
More informationTHE NATURE OF OBJECTIVITY WITH THE RASCH MODEL
JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that
More informationReliability, validity, and all that jazz
Reliability, validity, and all that jazz Dylan Wiliam King s College London Introduction No measuring instrument is perfect. The most obvious problems relate to reliability. If we use a thermometer to
More informationDevelopment of the Mental, Emotional, and Bodily Toughness Inventory in Collegiate Athletes and Nonathletes
Journal of Athletic Training 2008;43(2):125 132 g by the National Athletic Trainers Association, Inc www.nata.org/jat original research Development of the Mental, Emotional, and Bodily Toughness Inventory
More informationProceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)
EVALUATION OF MATHEMATICS ACHIEVEMENT TEST: A COMPARISON BETWEEN CLASSICAL TEST THEORY (CTT)AND ITEM RESPONSE THEORY (IRT) Eluwa, O. Idowu 1, Akubuike N. Eluwa 2 and Bekom K. Abang 3 1& 3 Dept of Educational
More informationLouis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison
Louis Leon Thurstone in Monte Carlo: Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science Rochester Institute
More informationTHE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES. Rink Hoekstra University of Groningen, The Netherlands
THE INTERPRETATION OF EFFECT SIZE IN PUBLISHED ARTICLES Rink University of Groningen, The Netherlands R.@rug.nl Significance testing has been criticized, among others, for encouraging researchers to focus
More informationChapter 2--Norms and Basic Statistics for Testing
Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.
More informationValidity and Reliability of the Malaysian Creativity and Innovation Instrument (MyCrIn) using the Rasch Measurement Model
Validity and Reliability of the sian Creativity and Innovation Instrument (MyCrIn) using the Rasch Measurement Model SITI RAHAYAH ARIFFIN, FARHANA AHMAD KATRAN, AYESHA ABDULLAH NAJIEB BADIB & NUR AIDAH
More informationLow Tolerance Long Duration (LTLD) Stroke Demonstration Project
Low Tolerance Long Duration (LTLD) Stroke Demonstration Project Interim Summary Report October 25 Table of Contents 1. INTRODUCTION 3 1.1 Background.. 3 2. APPROACH 4 2.1 LTLD Stroke Demonstration Project
More informationPersonality Traits Effects on Job Satisfaction: The Role of Goal Commitment
Marshall University Marshall Digital Scholar Management Faculty Research Management, Marketing and MIS Fall 11-14-2009 Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment Wai Kwan
More information(C) GoldCal LLC DBA GoldSRD
ROOT CAUSE ANALYSIS Danny M. Goldberg, Founder Root Cause Analysis Problem or Unwanted Event Occurrence Symptoms Problem or Unwanted Event Recurrence Apparent Cause Root Cause Prevent Wilson, Dell, and
More informationConceptualising computerized adaptive testing for measurement of latent variables associated with physical objects
Journal of Physics: Conference Series OPEN ACCESS Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects Recent citations - Adaptive Measurement
More informationRASCH ANALYSIS OF SOME MMPI-2 SCALES IN A SAMPLE OF UNIVERSITY FRESHMEN
International Journal of Arts & Sciences, CD-ROM. ISSN: 1944-6934 :: 08(03):107 150 (2015) RASCH ANALYSIS OF SOME MMPI-2 SCALES IN A SAMPLE OF UNIVERSITY FRESHMEN Enrico Gori University of Udine, Italy
More informationMeasuring change in training programs: An empirical illustration
Psychology Science Quarterly, Volume 50, 2008 (3), pp. 433-447 Measuring change in training programs: An empirical illustration RENATO MICELI 1, MICHELE SETTANNI 1 & GIULIO VIDOTTO 2 Abstract The implementation
More informationO ver the years, researchers have been concerned about the possibility that selfreport
A Psychometric Investigation of the Marlowe Crowne Social Desirability Scale Using Rasch Measurement Hyunsoo Seol The author used Rasch measurement to examine the reliability and validity of 382 Korean
More informationThe Effect of Option Homogeneity in Multiple- Choice Items
Manuscripts The Effect of Option Homogeneity in Multiple- Choice Items Applied Psychological Measurement 1 12 Ó The Author(s) 2018 Reprints and permissions: sagepub.com/journalspermissions.nav DOI: 10.1177/0146621618770803
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations
More informationA Brief (very brief) Overview of Biostatistics. Jody Kreiman, PhD Bureau of Glottal Affairs
A Brief (very brief) Overview of Biostatistics Jody Kreiman, PhD Bureau of Glottal Affairs What We ll Cover Fundamentals of measurement Parametric versus nonparametric tests Descriptive versus inferential
More informationCHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology
CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY 7.1 Introduction This chapter addresses the research design and describes the research methodology employed in this study. The sample and sampling procedure is
More information