Parallel Forms for Diagnostic Purpose

Paper presented at AERA, 2010 Parallel Forms for Diagnostic Purpose Fang Chen Xinrui Wang UNCG, USA May, 2010

INTRODUCTION With the advancement of validity discussions, the measurement field is pushing full research from the initial stage of a test development to the final stage of score interpretation and usage. Strictly speaking, a test designed for a specific purpose cannot be used for other unintended usage. However, there are so many goals a test is expected to fulfill for different audience that there are frequent compromise in practice. One strong argument in favor of using one test for more than one purpose is to minimize interference and time-robbing from class-room teaching. This is especially true for achievement tests in many parts of the world. With these in mind, how to maximize the information from a single test would be a topic of interest for many measurement researchers as well as practitioners. One approach to realize this may be to analyze data from different perspectives for different purposes. For example, using Item Response Theory (IRT) to find the best cutting score for selection purpose and using Cognitive Diagnostic Models (CDM) to look for profiles of skill-mastery or not for diagnostic, placement or program evaluation purposes. Another issue of interest to the measurement field is parallel test forms. Generating parallel test forms is important when measuring achievement. It helps with test security and it enables multiple windows for testing to ensure fairness for best performance of every test taker. It is even more so when the same test result is used for limited further educational opportunities. However, generating parallel forms is also a challenging task as tests have to strike a balance between content and measurement specifications at the same time (Gibson & Weiner, 1998). Under classical test theory (CTT), Item difficulty and item discrimination are used to measure whether test forms are parallel. With modern test theories (models) such as item response theory (IRT), one selects items from pre-calibrated banks according to the test information function under constraints. Tests do not need to be parallel in terms of item or test difficulty although it is still regarded better with comparable content coverage. However, while IRT has become a norm for modern testing, it cannot provide more refined information that can benefit teachers and students for diagnosing and teaching purpose. For this purpose, CDMs are developed. CDMs can produce detailed analysis of a person s ability profile and help maximize information from a test in addition to what IRT can provide. However, for this emerging model, the examination of whether test forms are parallel has not been covered as much in the literature. For this reason, it is interesting to explore procedures to evaluate the parallism of test forms from a cognitive diagnostic perspective. A detailed introduction to CDM is beyond the scope of this paper, interested readers can refer to Leighton and Gierl (2007), and Rupp, Templin and Henson (2010). Chinese readers can also refer to a non-technical introduction by Chen (2011). In CDMs, latent classes are involved, that is we regard the correct and incorrect answers to an item a manifestation of a group of latent attributes working together for that item. There may be different attribute patterns that can lead to the probability of overall correct or incorrect responses. These patterns are called latent classes. Within latent classes is the idea of conditional independence in which the probabilities of responses are independent of one another given the number of latent classes within the model (Rupp et al, 2010). CDM includes a whole series of sub-models with their own particular constraints. One model in particular, the noncompensatory reduced reparameterized unified model (NCRUM) assumes that that the probability of a correct response reduces as the number of attributes mastered decreases (Rupp et al, 2010). Simply though, the NCRUM requires that a person master simpler attributes before more complex attributes. A model such as this may be especially useful in achievement testing, where there are 2

several content areas that range from rudimentary to more complex, i.e., hierarchical learning is expected (Bloom, 1956). Benjamin S. Bloom developed the taxonomy of learning as a result of instruction, known as the Taxonomy of Educational Objectives, as an easier approach to developing examinations (Krathwohl, 2002). The taxonomy is used as a tool to measure learning in the six main cognitive domains: knowledge, comprehension, application, analysis, synthesis, and evaluation (Krathwohl, 2002). Furthermore, the taxonomy is arranged as a hierarchy; in order to progress to the next level in thinking, one must possess the skills of those levels by which it was preceded (Bloom, 1956). For example, one cannot progress to comprehension without having the skills acquired in knowledge. Although Bloom s taxonomy has been extended and developed into numerous new framework, the central concept remain: the cognitive skills are hierarchical. Is this assumption supported by real achievement test data? If so, is the cognitive diagnostic information for test-takers based on traditionally-defined parallel forms also consistent from CDM perspective? How to evaluate parallelism if the test purpose is for diagnostic analyses? We decided to explore parallelism of test forms in terms of cognitive assessment and diagnosing which is closely related to the proposal of maximizing information from achievement tests through CDMs. We demonstrate the considerations needed to evaluate parallelism for the purpose of diagnosing and explore indices that can help with the judgment. METHOD Data We used data from the 2007 Trends in International Mathematics and Science Study (TIMSS) program. TIMSS is an international program designed to improve students mathematical and science skills (Olson, Martin, & Mullis, 2009). TIMSS measures trends in math and science every four years at the fourth and eighth grade levels in fifty-nine countries (Olson et al, 2009). The sampling selection in TI MSS 2007 follows a systematic, two-stage probability proportional-to-size (PPS) sampling technique, where schools are first selected and then classes within sampled (and participating) schools. This sampling method is a natural match with the hierarchical nature of the population where classes of students are nested within schools. The schools are sampled to mimic the variety of school type, and classes within schools are sampled to mimic the diversity among classes. For our study, we used student sample of the United States that had a total of 545 students of which 50.6% were girls and 49.4% were boys. We also decide to focus on the mathematical test for exploration purpose, where the cognitive hierarchy is clear to define. For the mathematical test, students are given a booklet of questions to measure achievement (Olson et al, 2009). The booklets are divided into two blocks ( Block 1 and Block 2). While each student takes two different blocks, every one of the blocks is used between two groups of test-takers for linking purposes. The test questions are divided into three different cognitive domains: Knowing, applying, and reasoning, similar to that of Bloom s taxonomy. We can examine whether the two blocks of TIMSS mathematical achievement tests were parallel from a diagnostic perspective, i.e., whether they give similar estimation of students ability profile between the two forms according to the test blueprint. The NCRUM model was chosen based on our theory that the three cognitive domains are hierarchical in nature. Put simply, if reasoning is the skill required to respond to a question correctly, knowing the concept and being able to apply the knowledge is not enough to ensure correct answer. The chosen mathematical test is divided into two blocks with 13 and 16 questions respectively. The design of the blocks makes it clear that the blocks are assumed to be parallel in 3

terms of structure, content and quality. That is, they can be exchanged between each other and provide reliable score interpretation for any group of test takers. The cognitive skills are the focus of this paper and the coverage of the skills as measured by the blocks are defined by the test development team and summarized in Table 1. Table 1. Distribution of Cognitive Skills Block 1 Block 2 Knowing 3 6 Applying 9 6 Reasoning 1 4 Total 13 16 There are three relevant research questions: 1. How well can the TIMSS items discriminate between students with high and low cognitive abilities? 2. Can the two test forms (blocks) give consistent and reliable classification of students in terms of cognitive abilities? 3. Do student responses reflect a hierarchy of the three cognitive skills? Or is the assumed hierarchy of Bloom s taxonomy supported by the data? Model A cognitive diagnostic model (DCM/CDM), NCRUM, was chosen as the model for several reasons. First, we wanted to see more detailed information than just a total score. A regular unidimensional Item Response Model (IRT) was enough to provide overall item quality and person ability estimation but did not differentiate between test takers on each cognitive skill required for each item. However, DCM provided this type of information. When combined with analyses on math attributes such as Algebra and Number, as classified by TIMSS, users will be able to explain the differences in student responses in terms of both math ability and cognitive ability. As a subtype of DCM models, NCRUM assumes that the attributes measured by each item can not compensate for the lack of other attributes required for the item (Rupp et al, 2010). As previously mentioned this is consistent with Bloom s taxonomy and justifies our choice with this particular model within DCM families. A Q-matrix relevant to our research purpose is retrofitted to the data for analysis. To create a Q-matrix, we re-specified the cognitive skills that met Bloom s Taxonomy. Thus, an item that was meant measure Knowing was coded 1,0, 0, meaning the student only achieved level 1 in terms of cognitive skills. An item measuring Applying was coded as 1,1,0, meaning the student is at level 2; and an item measuring Reasoning was coded as 1,1,1. We used the program RUM, written by Dr. Robert Henson for the purpose of this analysis. It was an easy tool to implement and it provided outcomes for attribute-level item parameters p* and r* (Rupp, etal, 2010). These parameters were then used to calculate the attribute level discrimination parameters for the purpose of item evaluation. 4

Analyses procedures We calculated discrimination parameters using Equation 1 following the notation in Rupp et al (2010). A * * * qia dia = π π ria (Equation 1) a= 1 High π* and low r* indicates good items. We also used traditional difficulty index, the p-values in classical test theory and compared the results between the two approaches. Next, we classified the students into different categories. Although there were eight possible attribute categories, Bloom s taxonomy only allows four possible categories because of the hierarchy nature. However, we summarized both categories in our study for exploration of question 3. Finally, we compared the students profiles as the result of the test and compared the two blocks. The percentage of profile change was probed and analyzed. If the blocks were parallel and items were good, we expected to have a smaller percentage change for each attribute, and vice versa. RESULTS Item analyses Item discrimination analyses based on NCRUM model and classical test theory are shown in Table 2 and 3. Table 2. Attribute Level Item Discrimination based on Block 1 NCRUM Item Knowing Applying Reasoning 1.15 2.28.01 3.24.09.25 4.28 5.74.48 6.09.21 7.67.06 8.48.49 9.64 10.33.08 11.62.31 12.52.36 13.71.75 Mean.44.28.25 Table 3. Attribute Level Item Discrimination based on Block 2 5

NCRUM Item Knowing Applying Reasoning 1.25 2.27 3.56.53 4.11.19 5.37.20 6.34 7.29 8.54.21.74 9.85.73.90 10.60.44.05 11.27.38.06 12.26.15 13.40 14.27.02 15.45 16.84.50 Mean.42.34.44 There is little literature to guide the evaluation about the quality of items under NCRUM. We decided to use.20 as a reasonable cut point. This means that if the percentage of students that answer an item correctly (masters) is 20% more than non-masters, this item discriminates between the masters and non-masters very well. Using this rule, the two blocks were found to be discriminating the masters from non-masters moderately well. Specifically, out of all twenty-four attribute measures that could be evaluated for Block 1, eighteen of them were discriminating; out of thirty possible attribute measures for Block 2, twenty-four of them were discriminating. Classical Test Theory (CTT) was also used to examine the quality of the test for Blocks 1 and 2, which can be seen in Table 4. Overall, the 29 test items reliability was 0.848. According to the item statistics, item 22 (p-value = 0.14) and item 12 (p-value = 0.17) were the hardest items on the test. Item 27 (p-value = 0.84) was the easiest item among the examinees taking this test. The least discriminating items for the 29 items were item 17 (r pb = 0.162) and item 6 (r pb = 0.164). Table 4. Classical Test Theory: Reliability Statistics (N = 545) Block Cronbach s Alpha # of Items Block 1 Block 2.739.762 13 16 6

Block 1 consisted of 13 items that had a reliability of 0.739. The easiest item was item 1 (p-value = 0.81) and the hardest item was item 12 (p-value = 0.17). Item 6 (r pb = 0.159) and item 1 (r pb = 0.195) had the lowest discriminating values. Block 2 had a total of 16 items with a reliability of 0.762. Item 27 had a p-value of 0.84, which made it the easiest item in Block 2. Additionally, item 22 seemed to be the hardest item with a p-value of 0.14. Item 17(r pb = 0.158) doesn t discriminate well among examinees. These results are also listed in Table 5. Table 5. Classical Test Theory: Item Statistics (N= 545) Block 1 Block 2 Items M Corrected Item-Total Correlation Items M Corrected Item-Total Correlation Item1.81.19 Item 14.79.26 Item2.76.26 Item 15.79.31 Item3.63.20 Item 16.47.47 Item4.46.28 Item 17.51.15 Item5.28.53 Item 18.52.33 Item6.73.15 Item 19.63.36 Item7.52.51 Item 20.51.33 Item 8.23.32 Item 21.32.40 Item9.49.50 Item 22.14.48 Item10.24.29 Item 23.37.43 Item11.25.50 Item 24.52.30 Item12.17.44 Item 25.34.25 Item13.26.49 Item 26.78.35 Item 27.84.28 Item 28.79.40 Item 29.34.52 Block 1 and Block 2 were separated according to Bloom s Taxonomy to examine the item analyses in each subcategory (Table 6): Knowing, Applying, and Reasoning. In Block 1, Knowing had a total of 3 items with a reliability of 0.28. In the Applying category, 9 items were included with a reliability of 0.71; however, item analyses could not be conducted for the Reasoning category due to there being one item in that section. 7

In Block 2, Knowing included 6 items (α = 0.51) and Applying included 6 items (α = 0.49). The Reasoning section included 4 items with a reliability of 0.52. Table 6. Classical Test Theory: Reliability Statistics (N= 545) Block Cronbach s Alpha # of Items Block 1 Knowing Applying Reasoning.28.71 ---- 3 9 1 Block 2 Knowing Applying Reasoning.51.49.52 6 6 4 If a higher level cognitive skill is required for an item, most of the time the item can discriminate the lower level skill better than the higher skills. This is in line with Bloom s Taxonomy because more guessing may be involved for the item requiring higher skill, making responses to it more variable and discrimination less clear. In addition, having more items and/or high discriminating items would provide a better quality of items in both Block 1 and Block 2. When items are analyzed for the cognitive attributes specifically, we found that the item that discriminated Knowing best was item 9 in Block 2. This item measured Knowing, Applying and Reasoning, and only 14% students got the item right. Contrarily, item 6 in Block 1 discriminated Knowing least. This item measured Knowing and Applying, and 73% students got this item correct. The item that discriminates Applying best was item 13 in Block 1. It measured Knowing and Applying, and 26% student got the item right. The item that discriminated Applying least was item 2 in Block 1. It measured Knowing and Applying, and 76% students got this item right. Interestingly, item 2 is a word problem rather than a multiple choice question. This infers that perhaps the attributes the question was intended to measure were not clearly defined (for the questions for both blocks, see Appendix A). The item that discriminated Reasoning best was item 8 in Block 2. It measured all three attributes, and 32% students got this item right. However, the item that was least discriminating for the Reasoning attribute was item 10 in Block 2, which had a correct probability of 37%. For the attributes of Knowing and Applying, easy items had low ability of discrimination. This may suggest a ceiling effect for these items in that most students knew the correct answer; thus, the items lost the ability to discriminate. For the attribute of Reasoning, no such pattern was found. 8

Attribute Profile A probability higher than 0.60 suggests one has mastered an attribute, while a probability lower than 0.40 suggests that one has not mastered an attribute (Rupp et al, 2010). We deleted the cases with any probability of mastering an attribute between 0.40 and 0.60, meaning we needed more information for accurate classification of these students. 402 out of 545 individuals provided valid data. The possible latent class profiles for three attributes are shown in the Table 7. Table 7. Possible student profile classification Latent Class Attribute Profile 1 000 2 100 3 110 4 111 5 010 6 001 7 101 8 011 Only the first four classes were reasonable according to hierarchy described in Bloom s Taxonomy. In our results, each block generated five different attribute profiles. The attribute profile for latent class and the corresponding probability is shown Table 8. Table 8. The probability for attribute profiles Block 1 Block2 Latent class Attribute Profile for Latent p Attribute Profile for Latent p Class Class 1 α 11 = (0,0,0).493 α 21 = (0,0,0).420 2 α 12 = (0,1,0).032 α 22 = (1,0,0).318 3 α 13 = (1,0,0).313 α 23 = (1,0,1).002 4 α 14 = (1,1,0).159 α 24 = (1,1,0).142 5 α 15 = (1,1,1).002 α 25 = (1,1,1).170 According to the table, Bloom s Taxonomy is generally supported. Three attribute profiles had low probability percentages: (0,1,0), (1,0,1) and (1,1,1). The first two were unreasonable according to Bloom s Taxonomy and low p value in the data also supports these hypotheses. Excluding these attribute profiles, we found only four latent classes: (0,0,0), (1,0,0), (1,1,0), and (1,1,1). These are exactly what Bloom s taxonomy would predict. Among the four latent cases, we found the percentage of students mastering each attribute also reasonable. The probability decreases as the number of attribute increases. It suggests a hierarchy among these three attributes. However, the probability of latent class 5 in Block 1 was extremely low. This may be due to the fact that there is only one item in Block 1 to measure reasoning so the results may not be accurate. 9

Block Comparison Our hypothesis was that Block 1 and Block 2 should generate the same mastery profile for each person because these blocks are parallel. If the mastery profiles generated by two blocks are different, they are not parallel tests in reality. By comparing each examinee s attribute mastery profiles generated by Block 1 and Block 2, discrepancies were found across three attributes. The percentage of students switching classes between a master of an attribute and a non-master was 29.7% for Knowing, 17.4% for Applying and 12.2% for Reasoning. This means our Blocks could categorize the students reliably. Though the discrepancy seems to decrease with a higher cognitive domain, it should be noted that the probability of getting the items concerning applying and reasoning also decreases. Apparently, these two blocks were not parallel tests for classification and diagnostic purposes. A second reason for the discrepancy between the diagnostic results may have been due to the unbalanced item distribution in Block 1, in which there is only one item to measure reasoning. Therefore, the judgment for students reasoning ability obtained from Block 1 was not reliable. Relating CTT with CDM indices Generally speaking, both of the two blocks discriminated students well. The range of the discrimination index was (0.01, 0.75) based on CDM and (0.15, 0.53) based on CTT for Block 1. The range of the discrimination index was (0.02, 0.85) based on CDM and (0.15, 0.52) based on CTT for Block 2. All the item discrimination indices were positive, which show that those items were reasonable. While the discrimination index based on CTT was the item discrimination, the discrimination index based on CDM was the attribute discrimination. When the item only measured one attribute, the CDM discrimination index and the CTT discrimination index was consistent. When all attributes for an item had small discrimination indices, that item had a small CTT discrimination index as well (eg. Item 4 in Block 2). However, when the attribute discrimination index varied across the attributes for a certain item, the CTT item discrimination index seemed to be a balance of all attribute discrimination indices. SUMMARY This paper evaluated the efficacy of Booklet 1 of TIMSS 2007 on measuring the cognitive ability of students at the eighth grade. We used both methods based on Cognitive Diagnostic Modeling (CDM) and Classic Test Theory (CTT) to examine how well these items discriminated students with different ability levels. We demonstrated how to use various indices to evaluate parallelism from a CDM perspective and compared it to CTT indices. More empirical research should be done to explore the relationship between item difficulty in CTT and attribute discrimination in CDM. We also compared the two blocks in Booklet 1 regarding their ability to accurately classify students. The reliability evaluation based on CTT showed that both blocks had good internal consistency. However, after using CDM to categorize students into different latent classes, we did not obtain evidence that two blocks give consistent and reliable classification of students. The percentage of students switching classes between a master of an attribute and a non-master was 29.7% for Knowing, 17.4% for Applying and 12.2% for Reasoning. This leads to concerns in the validity of test score interpretation if the diagnosing feature is built into the design stage and is expected to be shared with score users. Fortunately, TIMSS was not designed for diagnostic purpose. If a test is designed for diagnostic purpose, concepts such as parallel 10

forms and reliability will have to be different from non-diagnostic tests. This paper explores this issue and cast a new understanding of these traditional concepts from a CDM perspective. This is relevant not only to the validity of test score interpretation but also the initial stage of test development where the test blueprint will have to consider new dimensions to ensure test quality. Of course, many other concepts related to the current trend to computer-based testing such as test-assembly engineering will also change. This is a worthwhile field for furthur exploration for diagnostic assessment. CDM was used to exam the hierarchy among the three cognitive domains. Our results support the hypothesis that student responses reflect a hierarchy of the three cognitive skills for TIMSS math measurement. If a higher level cognitive skill was required for the item, all the lower level skills had to be present to give a correct response. These three attributes are in the same order as defined in Bloom s Taxonomy, with Knowing being the lowest skill and Reasoning (Evaluation) being the highest. Reference Bloom, B.S. (1956). Taxonomy of educational objectives, handbook 1: The cognitive domain. New York: David McKay. 陈芳,2011, 诊断分类模型 : 测试领域的新工具, 外语教学理论与实践第 2 期,29-34. Gibson, W.M., Weiner, J.A. (1998). Generating parallel test forms using CTT in a computerbased environment. Journal of Educational Measurement, 35, 297-310. Hambleton, R.K., Swaminathan, H. (1990). Item response theory: Principles and Applications. Norwell, MA: Kluwer Academic Publishers. Krathwohl, D.R. (2002). A revision of Bloom s taxonomy: An overview. Theory Into Practice, 41, 212-218. Leighton, J. P., & Griel, M. J. (2007), Ed. Cognitive diagnostic assessment for education: theory and applications. New York, NY: Cambridge University Press. Olson, J. F., Martin, M.O. & Mullis, I.V. (2009). TIMSS 2007 Technical Report. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Olson, J.F., & Foy, P. (2009). TIMSS 2007 User Guide for the International Database. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Rupp, A.A., Templin, J., & Henson, R. (2010). Diagnostic measurement: Theory, methods, and applications. New York: Guilford Press. 11