MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST THEORY AND RASCH MEASUREMENT MODEL

Size: px

Start display at page:

Download "MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST THEORY AND RASCH MEASUREMENT MODEL"

Lora Carson
5 years ago
Views:

1 Man In India, 96 (1-2) : Serials Publications MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST THEORY AND RASCH MEASUREMENT MODEL Adibah Binti Abd Latif 1*, Ibnatul Jalilah Yusof 1, Nor Fadila Mohd Amin 1, Wilfredo Herrera Libunao 1 and Siti Sarah Yusri 1 The purpose of this study is to analyze the level of difficulty item and person ability using two measurement frameworks, Classical Test Theory (CTT) and Rasch Measurement Model (RMM). A total of 100 undergraduate students from Faculty of Education were responded to final examination paper, Research Methodology. This paper consists of 60 multiple choice questions (MCQ). The value of Cronbach- (CTT) obtained is 0.62 while Person Reliability (RMM) is 0.59 whereas Item Reliability is This study found that there is a slight difference in level of item difficulty index and person ability obtained from CTT and RMM. Additionally, there is no significant difference (p>.05) between item difficulty index (CTT) and item measure (RMM). This study also found that there is no significant difference of person ability (p>.05) estimates by CTT and RMM. RMM is theoretically considered as the superior measurement framework over CTT, this study however found that item and person statistics appeared to be similar for this two measurement frameworks. Thus, the interpretations beyond the philosophy were discussed. Keyword: Item analysis, Classical Test Theory, Rasch Measurement Model Introduction Musial et al. (2009) defined assessment as the art of placing learners in a setting that clarifies what learners experience and can do, as well as what learners may not recognize or cannot perform. It provides a picture of a student s advancements and achievements. The data obtained from an assessment are used as part of highstakes decision making about placement decisions such as choosing a program of study, promotion decisions such as tracking learning progress and determining whether students obtain certificates or other qualifications that empower them to achieve their objectives (Riley & Cantu, 2000; Braun et al., 2006). Malaysia educational system is now presented with the challenge of developing appropriate and the meaningful ways to evaluate the extent to which students are meeting the standards. Tests and examinations can accurately or inaccurately reflect the current level of students learning. However, a test can be studied from different angles and the items in the test can be evaluated accordingly to different theories or models that can provide better perspective on the relationship that may exists between the observed score on an examination and the underlying capability in the domain which is generally unobserved (Champlain, 2010). Two main test theory models that have been proposed for creating and evaluating test items are Classical Test Theory (CTT) and Item Response Theory (IRT). These two theories currently 1 Faculty of Education, Universiti Teknologi Malaysia, Malaysia * p-adibah@utm.my

2 174 MAN IN INDIA are popular measurement frameworks for identifying measurement problems such as test-score equating, test development and the identification of biased items (Hambleton & Jones, 1993; Lawson, 2006). To date, many educators in Malaysia are still using the CTT approach in analyzing tests items. Theoretically CTT is simple and easy to apply. Its straightforward and weak theoretical assumptions that easily met by test data, makes it extensively used in analyzing items (Hambleton & Jones, 1993; Champlain, 2010). However, many researchers started questioning their utility in the modern era (Amir et al., 2008). CTT has the limitation of circular dependency for estimating the test items parameters namely the item difficulty and item discrimination (Fan, 1998; Adedoyin & Adedoyin, 2013; Lawson, 2006; Stage, 2003). Circular dependency means that for example; an easy test can overestimate the ability estimates of the students while difficult test can do the reverse job by minimizing the abilities of examinees (Fan, 1998 & Amir et al., 2008). An individual will look as if they are low ability when the test is difficult, however a student will look as if they are high ability when the test is easy. It is thus difficult to compare the relative abilities of students taking two different tests (McAlpine, 2002). CTT considered the same total marks gained by the students indicate that they have the same abilities, regardless of whether it is easy items or difficult items. Therefore, this will affect an interpretation of students grading, ranking and reporting. In contrast to CTT, IRT generates rank ordering of students on the underlying trait rather than on the test scores. Students should be placed in the correct rank order regardless of which items that they chose to answer (McAlpine, 2002). Nevertheless, IRT has witnessed an exponential growth in recent decades as it is used to overcome the limitations of CTT (Nesé et al., 2013). Thus, this paper intended to compare item analysis using both approaches; CTT and IRT s Rasch Measurement Model (RMM). There are four objectives in this study: (i) To investigate the level of items difficulty using CTT and RMM approach. (ii) To analyze the significant statistical difference between item difficulties using CTT and RMM approach. (iii) To investigate the level of students ability using CTT and RMM approach. (iv) To analyze the significant difference between students ability using CTT and RMM approach. Classical Test Theory CTT introduces three concepts such as test score, true score and error score (Hambleton & Jones, 1993; Kline 2005) where test score often identified as observed scores and true score and error score identified as unobserved scores or latent. These concepts propose as an individual test score (X) consists of true score (T) and error score (E) which can be depicted in an equation below:

3 MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST X = T + E From this formula, it can be concluded that individual test scores is influenced by true score and error score. True score is the expected score that can be obtained by taking the mean score that an individual would get across equivalent or parallel forms (Hambleton & Jones, 1993; Kline 2005) while Harvill (1991) claims that true score represents an individual score that is uninfluenced by any random events. True score according to Miller et al., (2011) can never be known; it is just shown to be an expected score that can be obtained by an individual through parallel forms (Hambleton & Jones, 1993). In definition provided by Gronlund and Linn (1990), parallel forms are tests that administer the same group of individuals in close succession, and the test scores obtained are correlated, while Hambleton and Jones (1993) suggest that parallel forms are tests that measure the same content for which the true score and size of error score of all students are equal. Error score also known as error of measurement is the difference between obtained score and true score. The error score is random in nature, unsystematic, due to chance, uncontrolled, and unspecified factors that influenced individual test score (Harvill, 1991; Miller et al., 2011). It means that an individual score could be high or could be low because of the error score. Over an infinite number of testing, the error score will increase and decrease an individual score by exactly the same amount because of its random characteristic (Miller et al., 2011). One Parameter Logistic (Rasch Measurement Model) There are three widely used IRT models which are One-Parameter Logistic Model (1-PL), Two-Parameter Logistic Model (2-PL) and Three-Parameter Logistic Model (3-PL) where each of these models has its own parameters. One of the key components that distinguish this model is the Item Characteristics Curve (ICC) which graphically displays the information of each item generated by IRT (Kline, 2005; Gleason, 2008). One-Parameter Logistics Model (1-PL) also known as Rasch Model (Gleason, 2008; Kline, 2005; Adedoyin & Adedoyin, 2013) is the most basic model in IRT which estimates only one parameter, difficulty parameter (b) (Kline, 2005). In 1- PL, level of item discrimination (a) and guessing probability (c) are assumed to be constant (Magno, 2009). In 1-PL model, the ICC for each item is given by the equation below. ( bi ) e Pi ( ) ( bi ) 1 e Where P i ( ) represents the probability of student with ( ) ability responses to the i-th item correctly and b i is the level of difficulty value of the i-th item. The b i value typically ranged from -2 to 2 but can take more extreme values (Sick, 2008).

4 176 MAN IN INDIA As noted in Kline, (2005), b and are scaled using a normal distribution with a standard deviation of 1.0 and a mean of 0.0, hence Magno (2009) presents two summaries from this equation: (i) The easier the item, a high probability of students will answer it correctly (ii) Students with high ability more likely will answer the items correctly compared to students with less ability. Materials and Methods This study was conducted using quantitative survey research and item analysis approach. Population of this study was the undergraduate students of Semester II (2013/2014) from Faculty of Education in one of public universities in Malaysia. The number of students from Semester II 2013/2014 is 520 students. A hundred of students who took the Research Methodology paper which consists of 60 multiplechoice questions were purposively taken as samples for this study. Item Difficulty Item difficulty items were analyzed to see which items were difficult than other items based on the value of item difficulty index (p) of the items. Mitra et al. (2009) suggest item is considered difficult if the p-value is less than 0.3 while it is considered easy if the p-value is more than 0.7. For, item difficulty level (CTT), data were calculated using formula of item difficulty index which is the total of correct response divided by total response. While, item difficulty level using Rasch were analyzed using Winstep, and produced Item Map and Item Measure. The estimates of ability and difficulty calculated from the analysis were referred as logit/measure (Ludlow & Haley, 1995). Person Ability This study also intends to investigate the differences of students ability using CTT and IRT. Ability of students using CTT is based on the total scores obtained by students regardless the difficulty of items. Under CTT, students with higher scores will regard as high ability students while students with lower scores will regard as low ability students, while students ability under IRT is based on level of difficulty on an item. Students who were able to answer more difficult items correctly considered as higher ability students compared to students who answered more difficult items wrongly. The significant difference of item difficulty and students ability were tested using t-test analysis. Result Person Reliability was tested using RMM and the value was 0.56 which means that it has low consistency and Item Reliability for this test is 0.95 which indicated

5 MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST wide range of item measures or an adequate sample. Prior to item analysis using RMM, the examination paper was analyzed to check whether it fulfilled the assumption of unidimensionality. Table 1.0 shows that the raw variance explained by measures was less than 40% which is the minimum accepted value in using RMM (Azrilah et al., 2013). The Unexplained Variance in 1 st Contrast had a good value that was 5.6%. The value here should not exceed 15% which indicates too much noise (Azrilah et al., 2013). Thus, this dimensionality results showed the examination papers need to be revised especially in terms of the weightage of important content that being asked in the questions. Level Item of Difficulty Table 2.0 and Table 3.0 show in the details of the classifications of difficulty level using RMM and CTT respectively. It can be seen that items Q17, Q49, Q60 were in moderate difficulty level according to Rasch Analysis. Under CTT, these items were in high level difficulty. For items Q2, Q27, Q58, they were in moderate level using RMM and were in low level using CTT. Both CTT and RMM values of item difficulty index were standardized by transforming the values to Z-Scores. The score comparison was analyzed using t- test analysis. The result shows there was no significant difference [t(59),p>.05 ] of item difficulty index using CTT approach and RMM approach. Person Ability As can be seen from both Table 4.0 and Table 5.0, there was no student placed under high ability students. Students S45, S44, S1, S36, S37, S40, S49, and S8 were placed in moderately high ability in both RMM and CTT. Students S39, S12, S19, S35, S61, S88, S18, S21, S38, S48, S10, S22, S28, S34, S46, S69, S9, S24, S42, S57, S11, S20, S4, S47, S55, S59, S6, S62, S81, S89, S93 and S96 were in moderately high in RMM, in contrast to CTT, they were placed in moderately low ability. In RMM, students S2, S23, S33, S41, S50, S64, S7, S70, S74, S83, S84, S100, S14, S16, S26, S27, S3, S30, S56, S58, S66, S72, S73, S77, S79, S87, S94, S95, S31, S51, S78, S86, S92, S13, S15, S29, S43, S54, S63, S71, S80, S85, S90, S97, S98, S99, S17, S5, S65, S91, S25, S32, S52, S75, S76, S82, S53, S60, S67, and S68 were placed in moderately low, while in CTT, they were placed in students with low ability in responding to the test. Both CTT marks and RMM person measure values were standardized by transforming the values to Z-Scores. The comparison was analyzed using t-test analysis. Finding shows there was no significant difference [t(99), p<.05] of person ability according to CTT and RMM analysis.

6 178 MAN IN INDIA TABLE 1: UNIDIMENSIONALITY Assumption of Unidimensionality Percentage (%) Raw variance Explained (Empirical) 21.6 Raw variance Explained (Model) 21.3 Unexplained Variance (1 st Contrast) 5.6 Level of Difficulty High (above logit 0.82) TABLE 2: CLASSIFICATION OF ITEM DIFFICULTY LEVEL SUBJECT B (RMM) Items Q32, Q1, Q10, Q50, Q34, Q31, Q46, Q14, Q28, Q11 Moderately High (between logit Q17, Q49, Q60, Q30, Q40, Q45, Q5, Q29, Q36, 0.82 to logit 0.00) Q44, Q47, Q57, Q18, Q39, Q41, Q25, Q42, Q43, Q52 Moderately Low (between logit 0.00 to Q26, Q3, Q37, Q15, Q24, Q53, Q55, Q7, Q38, logit -1.18) Q56, Q16, Q20, Q22, Q6, Q19, Q48, Q21, Q33, Q13, Q4, Q8 Low (below logit -1.18) Q51, Q59, Q9, Q12, Q35, Q54, Q23, Q2, Q27, Q58 Level of Difficulty TABLE 3: CLASSIFICATION OF ITEM DIFFICULTY LEVEL SUBJECT B (CTT) Items High (p 0.3) Q32, Q1, Q10, Q50,Q34, Q31, Q46, Q14, Q28, Q11, Q17, Q60,Q49 Moderate (0.31 p 0.79) Q30, Q40, Q45, Q5, Q29, Q36, Q44, Q47, Q57, Q18, Q39, Q41, Q25, Q42, Q43, Q52, Q26, Q3, Q37, Q15, Q24, Q53, Q55, Q7, Q38, Q56, Q16, Q20, Q22, Q6, Q19, Q48, Q21, Q33, Q13, Q4, Q8, Q51, Q59, Q9, Q12, Q35, Q54, Q23 Low (p 0.8) Q2, Q27, Q58 Level of Person Ability TABLE 4: CLASSIFICATION OF PERSON ABILITY FOR SUBJECT B (RMM) Person Moderately High (between logit S45, S44, S1, S36, S37, S40, S49, S8, S39, S12, 0.82 to logit -0.37) S19, S35, S61, S88, S18, S21, S38, S48, S10, S22, S28, S34, S46, S69, S9, S24, S42, S57, S11, S20, S4, S47, S55, S59, S6, S62, S81, S89, S93, S96 Moderately Low (between logit S2, S23, S33, S41, S50, S64, S7, S70, S74, S83, to logit -1.18) S84, S100, S14, S16, S26, S27, S3, S30, S56, S58, S66, S72, S73, S77, S79, S87, S94, S95, S31, S51, S78, S86, S92, S13, S15, S29, S43, S54, S63, S71, S80, S85, S90, S97, S98, S99, S17, S5, S65, S91, S25, S32, S52, S75, S76, S82, S53, S60, S67, S68

7 MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST TABLE 5: CLASSIFICATION OF PERSON ABILITY FOR SUBJECT B (CTT) Level of Person Ability Person Moderately High S45, S44, S1, S36, S37, S40, S49, S8 (Marks: 74 to 60) (Grade Point: 3.33 to 2.67) Moderately Low S39, S12, S19, S35, S61, S88, S18, S21, S38, S48, (Marks: 59 to 45) S10, S22, S28, S34, S46, S69, S9, S24, S42, S57, (Grade Point: ) S11, S20, S4, S47, S55, S59, S6, S62, S81, S89, S93, S96 Low S2, S23, S33, S41, S50, S64, S7, S70, S74, S83, (Marks: 44 to 00) S84, S100, S14, S16, S26, S27, S3, S30, S56, S58, (Grade: ) S66, S72, S73, S77, S79, S87, S94, S95, S31, S51, S78, S86, S92, S13, S15, S29, S43, S54, S63, S71, S80, S85, S90, S97, S98, S99, S17, S5, S65, S91, S25, S32, S52, S75, S76, S82, S53, S60, S67, S68 Discussion Findings show there were no significant differences towards item difficulty and students ability using RMM and CTT. Research done by Idowu et al. (2011) indicated that item statistics derived from the two measurement frameworks are quite comparable and appeared to be similar for CTT and IRT. However, in categorizing the item difficulty and students ability based on cut-off score, there were some same items that fall under different level and same person was categorized under different abilities. These finding supported by research done by Dibu (2013) where person statistics derived by CTT and IRT produces similar results. Amir, et al. (2008) also found that an analysis of the ability level of individual examinee lead to similar results across the different measurement theories. Fan (1998) examined the behaviors of item and person statistics using IRT and CTT, showed that there was not much difference of item and person statistics using CTT and 1-PL, 2-PL, 3-PL. The likeness of the findings among the researchers shows that by using the total score of the marks, the probability to rank students in the same level is high using CTT and IRT. This is because both IRT and CTT at the first place will use the total score without make any justification of students pattern and process in answering the questions. Hence, the possibility to rank them in the same ability is high. For example if two different students got the same marks in exam, let say they got 80 marks, both CTT and IRT will placed them under same abilities based on their raw scores. In CTT, the interpretation of this achievement will conclude as same, but not in IRT. In IRT, the interpretation of students answer is based on student responses on easy and difficult items. The students with same marks will be interpreted as having different abilities if one of them score more on easier item while the other one

8 180 MAN IN INDIA score more on difficult items. The student that answers more difficult items correctly will be classified as student with higher abilities. In IRT, analysis of scalogram using Guttman Scale is the best way to differentiate students according to their ability in answering difficult items. For example, if the students with high marks answer more on difficult items correctly, it shows a positive direction, but if the students got high marks with more easy items correctly, while many difficult items are wrong, the direction is negative and the ability will consider lower than the previous types of students. From the scalogram, the pattern of students answer can be predicted. For examples, we can predict either the students have lucky guess on answering some items correctly or the students really have good knowledge in answering the items. Prediction can also be made if the students do not answer the items. IRT can determine either the students really do not know the answer, or have not time enough to answer the item or may be the students intentionally not answering the item. By analyzing all of these patterns thru Guttman scale in scalogram, the fair judgment towards students performance and accurate decision making can be done. IRT is theoretically considered as the superior measurement framework over CTT. Although this study found that item and person statistics appeared to be no significant differences between these two measurement frameworks, but the interpretation using IRT will give rich information in judging students achievement. Acknowledgment This research was funded by Ministry of Education and Research Management Centre UTM thru Fundamental Research Grant Scheme Vott Number 4f381 References Adedoyin, O. O., & Adedoyin, J. A. (2013). Assessing the comparability between classical test theory (CTT) and item response theory (IRT) models in estimating test item parameters. Herald Journal of Education and General Studies, Volume (2), Amir, Z., Atiq-Ur-Rehman, K., Mamoon, M., & Arshad, A. (2008). Students Ranking, Based on their Abilities on Objective Type Test: Comparison of CTT and IRT. Proceedings of the EDU-COM 2008 International Conference Azrilah, A.A, Saidfudin, M.M & Azami, Z. (2013). Asas Model Pengukuran Rasch Pembentukan Skala & Struktur Pengukuran. Malaysia. Penerbit Universiti Kebangsaan Malaysia. Braun, H., Kanjee, A., Bettinger, E., & Kremer, M. (2006). Improving Education Through Assessment, Innovation, and Evaluation. Cambridge: American Academy of Arts and Sciences. Champlain, A. F. (2010). A primer on classical test theory and item response theory for assessments in medical education. Medical Education, 44(1), Dibu, O. O. (2013). Classical Test Theory (CTT) VS Item Response Theory (IRT): An Evaluation of the Comparability of Item Analysis Results. Lecture Presentation. Lecture conducted from, Abuja, Nigeria. (May, 23).

9 MULTIPLE-CHOICE ITEMS ANALYSIS USING CLASSICAL TEST Fan, X. (1998). Item Response Theory and Classical Test Theory: An Empirical Comparison of their Item/Person Statistics. Educational and Psychological Measurement, Gleason, J. (2008). An evaluation of mathematics competitions using item response theory. Notices of the ACM, 55(1). Gronlund, N. E., & Linn, R. L. (1990). Measurement and Evaluation in Teaching. (6 th ed.). New York: Macmillan Publishing Company. Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice. Hambleton, R. K., & Swaminathan, H. (1995). Item response theory: principles and applications. Norwell Kluwer-Academic Publishers. Harvill, L. M. (1991). Standard Error of Measurement. Educational Measurement: Issues and Practice, 10: Idowu, E. O., Eluwa, A. N., & Abang, B. K. (2011). Evaluation of Mathematics Achievement Test: A Comparison Between Classical Test Theory (CTT) and Item Response Theory (IRT). Journal of Educational and Social Research, 1(4). Kline, T. (2005). Psychological Testing a Practical Approach to Design and Evaluation. Thousand Oaks: Sage Publications. Magno, C. (2009) Demonstrating the Difference between Classical Test Theory and Item Response Theory Using Derived Test Data. The International Journal of Educational and Psychological Assessment, 1(1), McAlpine, M. (2002). A Summary of Methods of Item Analysis. University of Glasgow: Robert Clark Centre for Technological Education. Miller, L. A., McIntire, S. A., & Lovler, R. L. (2011). Foundations of psychological testing: a practical approach. (3 rd ed.). Thousand Oaks: Sage Publications. Mitra, N.K., Nagaraja, H.S., Ponnudurai, G. & Judson, J. P. (2009). The levels of difficulty and discrimination indices in type A multiple choice questions of Pre-clinical Semester 1 multidisciplinary summative tests. IeJSME, 3, 1, pp. 2-7 Musial, D., Nieminen, G., Thomas, J., & Burke, K. (2009). Foundations of Meaningful Educational Assessment. New York: McGraw-Hill. Neşe, G., Gülden, K. U., & Gülşen, T. T. (2013). Comparison of classical test theory and item response theory in terms of item parameters. European Journal of Research on Education, 2(1), 1-6. Lawson, D. M. (2006). Applying the Item Response Theory to Classroom Examinations. Journal of Manipulative and Physiological Therapeutics, Ludlow, L.H. & Haley, S.M. (1995). Rasch model logits: Interpretation, use, and transformation. Educational and Psychological Measurement, 55: Riley, R., & Cantu, N. (2000). The Use of Tests as Part of High-Stakes Decision-Making for Students: A Resource Guide for Educators and Policy Makers. Washington, DC: U.S. Department of Education, Office of Civil Rights. Stage, C. (2003). Classical Test Theory or Item Response Theory: The Swedish Experience. Centro de Estudios Públicos, 42.

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,