Substantive and Cognitive Interpretations of Gender DIF on a Fractions Concept Test. Robert H. Fay and Yi-Hsin Chen. University of South Florida

Similar documents
Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Differential Item Functioning from a Compensatory-Noncompensatory Perspective

Revisiting Differential Item Functioning: Implications for Fairness Investigation

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Academic Discipline DIF in an English Language Proficiency Test

Differential Item Functioning

André Cyr and Alexander Davies

Keywords: Dichotomous test, ordinal test, differential item functioning (DIF), magnitude of DIF, and test-takers. Introduction

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Parallel Forms for Diagnostic Purpose

Gender-Based Differential Item Performance in English Usage Items

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Testing the Multiple Intelligences Theory in Oman

Published by European Centre for Research Training and Development UK (

By Hui Bian Office for Faculty Excellence

AN ASSESSMENT OF ITEM BIAS USING DIFFERENTIAL ITEM FUNCTIONING TECHNIQUE IN NECO BIOLOGY CONDUCTED EXAMINATIONS IN TARABA STATE NIGERIA

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

International Journal of Education and Research Vol. 5 No. 5 May 2017

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Still important ideas

An Introduction to Missing Data in the Context of Differential Item Functioning

Differential Performance of Test Items by Geographical Regions. Konstantin E. Augemberg Fordham University. Deanna L. Morgan The College Board

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Basic concepts and principles of classical test theory

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Multidimensionality and Item Bias

Writing Reaction Papers Using the QuALMRI Framework

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Technical Specifications

Chapter 7: Descriptive Statistics

Information Structure for Geometric Analogies: A Test Theory Approach

Exploratory Factor Analysis Student Anxiety Questionnaire on Statistics

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Communication Research Practice Questions

Three Generations of DIF Analyses: Considering Where It Has Been, Where It Is Now, and Where It Is Going

Introduction to Test Theory & Historical Perspectives

V. Measuring, Diagnosing, and Perhaps Understanding Objects

The Influence of Conditioning Scores In Performing DIF Analyses

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

A Differential Item Functioning (DIF) Analysis of the Self-Report Psychopathy Scale. Craig Nathanson and Delroy L. Paulhus

Rewards for reading: their effects on reading motivation

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

Variability. After reading this chapter, you should be able to do the following:

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

A framework for predicting item difficulty in reading tests

A Comparison of Several Goodness-of-Fit Statistics

GRE R E S E A R C H. Development of a SIBTEST Bundle Methodology for Improving Test Equity With Applications for GRE Test Development

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Encoding of Elements and Relations of Object Arrangements by Young Children

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Getting a DIF Breakdown with Lertap

Session 2, Document A: The Big Ideas and Properties in Multiplication

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Span Theory: An overview

The Role of Modeling and Feedback in. Task Performance and the Development of Self-Efficacy. Skidmore College

Item Difficulty Modeling on Logical and Verbal Reasoning Tests

Techniques for Explaining Item Response Theory to Stakeholder

MS&E 226: Small Data

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

A Multidimensionality-Based DIF Analysis Paradigm

Tasks of Executive Control TEC. Interpretive Report. Developed by Peter K. Isquith, PhD, Robert M. Roth, PhD, Gerard A. Gioia, PhD, and PAR Staff

- Triangulation - Member checks - Peer review - Researcher identity statement

Detection Theory: Sensitivity and Response Bias

Examining the Psychometric Properties of The McQuaig Occupational Test

2013 Supervisor Survey Reliability Analysis

Rasch Versus Birnbaum: New Arguments in an Old Debate

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

Item Analysis Explanation


REPORT. Technical Report: Item Characteristics. Jessica Masters

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Carrying out an Empirical Project

Making a psychometric. Dr Benjamin Cowan- Lecture 9

Still important ideas

A critical look at the use of SEM in international business research

Models in Educational Measurement

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Optimization and Experimentation. The rest of the story

linking in educational measurement: Taking differential motivation into account 1

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Transcription:

1 Running Head: GENDER DIF Substantive and Cognitive Interpretations of Gender DIF on a Fractions Concept Test Robert H. Fay and Yi-Hsin Chen University of South Florida Yuh-Chyn Leu National Taipei University of Education Presented at the annual meeting of National Council on Measurement in Education, San Diego, California, April 13-17, 2009 Please address correspondence concerning this manuscript to: Robert Fay Department of Educational Measurement and Research University of South Florida 4202 E. Fowler Ave. EDU 162 Tampa, FL 33620-7750 Email: rfay@mail.usf.edu

2 Substantive and Cognitive Interpretations of Gender DIF on a Fractions Concept Test Background Much previous research has been done in analyzing gender differential item functioning (DIF). Most of it has involved dissimilitude in performance on quantitative type items. DIF studies have been conducted since Eells, Davis, Havighurst, Herrick, and Tyler (1951) tried to differentiate between test items that favored children who came from a high SES background from those items that favored children from low SES households. Since then, dozens of studies have been published on the subject. One of the first gender DIF studies was done by Hambleton and Traub (1974), who explored the multi-dimensional effects between gender and item order on math tests, but the results were inconclusive. Some subsequent gender DIF studies were more profound in their conclusions, with certain dissimilarities of results realized between them. In general, it was determined that boys and girls do indeed perform differently on different types of questions. Plake and associates (1982) found that boys did better than girls on timed tests where the difficult items were back loaded. Word problems were found to be differentially easier for boys, as were any problems requiring higher level thinking skills, including geometry and arithmetic, while girls and boys did equally well on test items involving spatial components (Ryan & Chiu, 2001). Alternatively, Ryan and Fan (1996) found that algebra and computation items were differentially easier for females with geometry items differentially easier for males. Their results also indicated that the arithmetic items were differentially easier for females. Once again, the applied items were found to be differentially more difficult for females. Berberoglu (1995), who studied items of a mathematics test across both gender and SES groups, determined that males had the advantage on computation items, but that females actually performed better on

3 word problems and geometry items, indicating that females had better verbal and spatial ability and males had better overall computational skills. Inabi and Dodeen (2006) found 37 of 124 items on an 8 th grade math test displayed gender DIF. In this study, all the measurement-type items with DIF favored boys while most of the DIF items in the algebraic and data analysis areas favored girls. Most of the latter were items that were unfamiliar to girls and had open-ended answers with risk, involving some estimations, expectations, or approximations, but often no finite answers. In contrast, most of the DIF items that favored girls were familiar items with one specific correct answer. Bielinski and Davison (2001) reported a sex difference by item difficulty interaction where easy items tended to be easier for girls than boys, and hard items tended to be harder for girls and easier for boys. Wang and Magone (1996) used a procedure based on logistic discriminant function analysis on a math test with open-ended tasks taken by middle school students, and concluded that with regard to uniform gender DIF four tasks favored female students and two tasks favored male students. The two tasks that favored boys had figures not drawn to scale, while the four tasks that favored girls did not. One task involving geometry skills displayed severe DIF and it favored males. Hamilton (1999) showed that gender differences were largest on items that involved visualization and called on knowledge acquired outside of school to explain gender DIF. Walstad and Robson (1997) analyzed items from the Test of Economic Literacy (TEL), using item response theory (IRT) to identify test questions with large gender DIF. Although there was a statistically significant difference between girls and boys scores before DIF items were removed, there were indications that there were other sources besides DIF to explicate gender differences. Some of these things were suggested to be differential reasoning, differences in socialization skills, or different instructional methods or testing formats. Roussos and Stout

4 (1996) developed a multi-dimensionality paradigm that combined substantive (without statistical confirmation) DIF processes like cognitive differences, and other descriptive content, with statistical DIF analysis. Even though this concept had been elucidated in previous papers, including that of Cronbach (1990), Jensen (1980), Messick (1989) and Wiley (1990), Roussos and Stout expanded on the paradigm by being the first to try to increase statistical power by using bundles of items measuring similar dimensions. Multidimensionality-Based DIF Analysis Paradigm The simultaneous item bias test (SIBTEST) was created using Roussos and Stout s multidimensional paradigm to detect differential item and bundle functioning (DIF and DBF). Ryan and Chiu (2001) used SIBTEST to study word problem items to test for DBF against the total score of non-word problem items in order to determine if girls and boys were differentially affected by a change in item position within the test. Walker and Beretvas (2001) tested the hypothesis that open-ended math test items were multi-dimensional, measuring math communication skills as well as general math ability between proficient and non-proficient fourth- and seventh-grade writers. Finally, SIBTEST was used on items from a curriculum-based math achievement test to quantify the effect size of DIF and to test general DIF hypotheses like whether the data actually possess two distinct dimensions, as defined by Roussos and Stout (Gierl, Bisanz, Bisanz & Boughton, 2003). Items said to measure two dimensions were matched against items said to measure only one dimension. In this study, we intended to make similar explanations for gender DIF items and to offer these as possible remediation tools for future test takers. Purpose of the Research This study investigated whether gender differential item functioning (DIF) was present in

5 any of the twenty-three items and whether gender differential bundle functioning (DBF) was present in parcels of items on a fractions concept test for Taiwanese elementary school students. That is, both DIF and DBF were measured. The objective was to determine the reasons behind the gender DIF, when present, and to proffer a solution for eliminating the source of these reasons. Did problems with the contextual properties of a specific item or items lead to one gender or the other being more familiar with the particular framework of the items, thereby leading to differential problem solving strategies for the two groups, and, to concomitantly different performance levels, as well? Both exploratory single item and confirmatory bundle DIF analyses were carried out. Methods Participants The data was collected from 2612 fifth and sixth grade students in Taiwan with 1283 fifth graders (49.12 %) and 1329 sixth graders (50.88 %). There were 1330 girls (50.92%) and 1282 boys (49.08%). Efforts were taken to obtain a representative sample of the population by first separating Taiwan into six school regions. Then schools were grouped by size within each region, with schools having less than 13 class rooms considered small, those with 13 to 35 class rooms called medium-sized and schools with more than 35 class rooms being classified as big. Stratified random sampling was then done within the regions to obtain a representative distribution for class size. Instrument A twenty-three item multiple-choice test on fractions concepts, developed by Chan and Leu (2004), was administered to the sample of 2612 students. It was designed to measure three major concepts of fractions known as the equal sharing concept, the units concept, and the

6 equivalent fraction concept. A fourth concept, the basic fraction concept, was also considered crucial for students to understand in order to solve fraction test items correctly, and so these types of items were also incorporated into the test. Some test items require knowledge in more than one of these conceptual areas if they are to be solved. Table 1 presents the four major fractions concepts, along with the relevant test items for each. The Cronbach s alpha for the entire test was 0.85. Task Analysis In addition to the four content categories, the fraction items were also classified into knowledge, skills, and abilities (KSAs; Gorin 2006; Chen, Gorin, Thompson, & Tatsuoka, 2008) that are related with solving items. Seven cognitive component item categories were constructed by MacDonald, Chen, Li, and Leu (2009) based on mathematics cognitive skills from Corter and Tatsuoka (2002) and a variation of the cognitive component factor structured developed by Chan, Leu, and Chen (2007). The seven cognitive component item categories were validated by linear logistic test theory (LLTM) for the sample used in this study (MacDonald, Chen, Li, & Leu, 2009). The seven bundles tested had items clustered together according to the following types of cognitive components: 1) Using illustrations; 2) Providing an written interpretation; 3)Judgmental application; 4) Computation; 5) Checking options; 6) Spatial unfolding; and 7) Solving routine problems. Statistical Analysis A series of statistical DIF and DBF analyses was conducted. First, individual gender DIF analyses were performed using the logistic regression (LR) approach. A variation of the logistic regression code developed by Zumbo (1999) was used to determine if there was either uniform or non-uniform DIF for any of the twenty-three test items in regard to the binary gender variable.

7 The Zumbo s LR DIF approach starts a three-step process that determines the Nagelkerke R- squared measure of DIF effect size, for both uniform and non-uniform DIF. The total scale score for each student and the binary gender variable were used as independent variables in the logistic regression equation for determining uniform DIF, while an interaction term between the two was added for the non-uniform DIF determination. Chi-square changes with two degrees of freedom for the uniform DIF and one degree of freedom for the non-uniform DIF were calculated simultaneously with the R-squared measures for effect size. If either of the p-values for an item were 0.05 or less, the item was flagged for DIF. Ultimately, researcher judgmental analysis was used for each of the DIF items in order to determine why the flagged items had DIF. If boys and girls were learning differently or being exposed variably to certain key learning stimuli, it would be crucial to be able to explain the reason(s) for the gender DIF between them. Next, SIBTEST, a non-parametric statistical method, was used to assess differential item functioning (DIF) for individual items (this would be compared to the results of the logistic regression done above) and differential bundle functioning (DBF) for a bundle of items. The statistic SIB and the bias estimator β were calculated (for a discussion of the theory of SIB and β, see Shealy & Stout, 1993). The amount of DIF in the studied subtest in SIBTEST is seen in the parameter estimate, β UNI, as explained by Shealy and Stout: β UNI = β(ө)ƒ F (Ө)dӨ, where β(ө) = P(Ө,R) P(Ө, F), the difference in the probabilities of correct response for the test-takers from the reference and focal groups, respectively, conditional on Ө; ƒ F (Ө) is the density function for Ө in the focal group; and, d is the width of the scaling interval. β UNI is

8 integrated over Ө to produce a weighted expected score difference between the reference and focal group examinees of the same ability on an item or bundle of items. SIBTEST was used to assess this parameter estimate with the test statistic, SIB = β UNI /σ( β UNI ), where σ( β UNI ) is the estimated standard error of β UNI. Shealy and Stout (1993) showed that SIB was normal with a mean of 0 and variance 1 under the null hypothesis of no DIF. To examine DBF, items on the fraction test were first separated into the studied subtest and the matching subtest based on the four item content groupings (see Gierl et al., 2003 for more detailed procedures). Before the DBF analysis (confirmatory DIF analysis), an individual item DIF analysis (exploratory DIF analysis) was conducted with SIBTEST, to be compared to the results from the LR approach, in an effort to improve the task analysis on all of the test items. A Crossing-SIBTEST was done for each of the twenty three items in order to perform a comparison of the results of the non-uniform DIF analysis with the concomitant results obtained with the LR approach, as well. Finally, the cognitive-skills bundles were examined for DIF using SIBTEST. This was done by dividing the items from each theorized cognitive component bundle into subject subtest items that were believed to have multidimensionality (based on their theorized constructs) and, therefore, should show DIF, and into those matching subtest items that were thought to contain only one dimension. This matching subtest places the boys and girls into subgroups at each score level so their performances on the studied subtest items can be compared (Gierl, et. al., 2003). Results Uniform and non-uniform item DIF Eleven of the twenty three items on the fractions concept test displayed uniform gender

9 DIF, from the LR approach, while nine of them showed non-uniform gender DIF. The concomitant R-squared values for the DIF items effect sizes had larger incremental increases in size than did the same values in test items without DIF, a clear indication of which items behaved differentially. Table 2 shows the results of this analysis. It was expected that the results of the SIBTEST for DIF would produce similar results to those found using the LR approach above. Fortunately, these expectations were realized, increasing the validity of both methods in the precision of the DIF analysis of the fraction items (See Table 3). Only with one of the most borderline items was there a discrepancy between the results of the two methods. Ten of the same eleven items found to have DIF in the Nagelkerke logistic regression approach also showed DIF when the SIBTEST was used. One borderline DIF item, Item 21, showed uniform DIF using LR (its change in Chi-square of 4.36 was the smallest change of the eleven items showing DIF), but was not quite significant for it using SIBTEST (.071). Six of the ten DIF items favored males (Items 1, 2, 7, 14, 18, and 20) while the other four (Items 9, 11, 16, and 19, along with Item 21) displayed DIF favoring girls. An explanation of some possible reasons for these gender differences follows. First, the male DIF items: Items 1 and 2 use spatial representation, which is a trait favoring boys (Gallagher, et al., 2000). Both of these items also use sweets for context, a subject slightly more motivating for boys of this age than girls (Allesen-Holm, Bredie, & Frost, 2008); Item 7 also talks about sweets and involves another skill favoring boys, again taken from Gallagher; i.e., the transformation of information from one spatial format to another.

10 Items 14 and 18 require multiple solution paths, which Gallagher et al. say is a problem solving skill that favors boys. Item 18 also involves sweets. Item 20 requires the conversion of a word problem to a spatial representation, yet another skill that Gallagher and colleagues say favors boys. The girl dominant DIF items can be explained similarly: Items 9 and 11 posed situations where solving a fraction required one to pick the solution from a series of pictures. Hamilton (1999) did say that problems involving visualization favored girls and that, furthermore, gender DIF can really only be explained by looking at things outside the classroom environment. This may be a skill that is more commonly found in girls of this age than boys, but there is no affirmation of that in the literature. In this case, the content of the DIF items did not explicitly manifest distinct themes that would help identify the source of the DIF. Item 16 is easier for girls because they read word problems more carefully (Berberoglu, 1995) and thus can more easily convert the words into algebraic solutions (Gallagher, et al., 2000). There are both marbles and buttons in the scenario but only the number of marbles is needed to solve the problem. Careful readers would notice that. Item 19 contains a reference to fruit, a type of food that favors girls, who don t have the wilder tastes of boys (Allesen-Holm, et al., 2008). Also, the wording of the question could be misinterpreted more easily than most. The non-uniform DIF results were also similar for the two methods. This time Item 21 did not differ, with no DIF being indicated (just barely) with Crossing-SIBTEST, as the results

11 there revealed a p-value of 0.76, or with the LR method, as the change in Chi-square was 4.58, not quite high enough to be showing DIF. The other borderline DIF item, Item 16, did differ between the two analytical methods. The LR approach showed its chi-square change to be just above that for Item 21, but still just below the DIF range (the lowest Chi-square change for any of the non-uniform DIF items was over 6 (Table 2). Meanwhile, the Crossing-SIBTEST resulted in a p-value right at 0.5 for this item, affirming its borderline status. The other nine non-uniform DIF items were commonly unveiled by both methods. Content and Cognitive Differential Bundle Functioning The analyses of DBF for the four content and seven cognitive bundles were performed exclusively with SIBTEST, using the option for grouping of subject items. Two of the four content bundles were revealed to have DBF (See Table 4). The Equal Sharing conceptual grouping, consisting of items that ask the student to figure out what proportion of some product remains after some fraction is shared with others, uses illustrations to show the separated fractions. This group consists of Items 1, 2, 3, and 10 (See Table 1). However, this grouping showed DBF favoring boys, with a positive beta-estimate of 0.126 (p <.01). A probable explanation of this, according to the modified Gallagher et al. taxonomy (2000), is that at least one of these items, Item 10, requires multiple solution paths to solve it, and the ability to do this is more often found in boys of this age than it is in girls. The Equivalent Fraction conceptual grouping is much larger, containing Items 3, 5, 6, 8, 9, 11, 12, 17, 18, 19, 20, and 21, but also has DBF. Girls are favored on the items in this bundle, with a negative beta-estimate of -0.36 (p <.01). The other two content bundles, Basic Fraction Concepts and Units Concepts, were

12 bereft of DBF, having Beta-estimates of 0.051 and 0.003, respectively. With two of the four content bundles having DBF among their item groups, there seems to be no valid basis of these two major concept areas. Although the four content area groups are known major concept areas of fractions, some of the items from this fractions test may measure the secondary or third concepts or abilities, with items favoring males dominating the Equal Sharing Concept group and items favoring females dominating the Equivalent Fraction Concept group. Since Items 1 and 2 are positive for male DIF, the Equal Sharing group was bound to show this, as well. The Equivalent Fraction group contains four items that favor females, and just two with male dominant DIF, thus it is no surprise that this group has bundle DIF favoring females. Showing DBF in the test can illuminate the fact that there are a lot of individual items with DIF. It can show how items should not be conceptually bundled or which content areas are not valid constructs if unidimensional component clustering is the goal. In this case, it was determined that fraction items involving sharing pieces of food favor boys and so this type of item should use some other tangibly shared object, less gender neutral (food shows itself to be differentially pleasing boys over girls of this age), if the Equal Sharing concept items are to be formed as a unidimensional bundle. Likewise, the Equivalent Fraction group could be constructed better if it is to be an illustrative collection of items representing that concept. A much different set of bundles were those defining the seven cognitive components that are required to solve correctly (Table 5). The SIBTEST analyses showed all seven bundles to be free of DBF, indicating that the items in each were defining their respective groups precisely and without significant gender differentiation. No multidimensionality was seen, although it had been posited. The cognitive bundles used here were not informative in this way. However, these specific cognitive bundles, by not showing DBF, should influence the design of future cognitive

13 component bundles in future studies that will further elucidate the dimensions of cognition in the human mind, making for the creation of psychometric tests that can better incorporate different cognitive skills and processes into the test items. Educational Significance The findings from this study should help to confirm the reasons for gender DIF in girls and boys by elucidating which conceptual areas within a fractions test show differences by gender. A further significant finding of this study is that psychometric tests and cognitive psychology content areas can be bridged to form precise cognitive component bundles from test items. The implications of this are potentially enormous, as testing can be made much more specific to the cognitive skills to be measured, and more cognitive skills and processes can be understood and included on psychometric tests. Other studies have proffered reasons for gender DIF and DBF on other types of math tests, but the conceptual differences when solving fraction test items has not before been elaborated upon. Test makers and teachers should be able to avail themselves of this knowledge and develop more focused tests and teaching techniques that address these issues in this subject area.

14 References Berberoglu, G. (1995). Differential item functioning (DIF) analysis of computation, word problem and geometry questions across gender and SES groups. Studies in Educational Evaluation, 21(4), 439-56. Bielinski J., & Davison, M.L. (Spring 2001). A sex difference by item difficulty interaction in multiple-choice mathematics items administered to national probability samples. Journal of Educational Measurement, 38(1), 51-77. Camilli, G, & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage Chan, W. H., & Leu, Y. C. (2004). The design of the rating scale of fraction for 5 th and 6 th graders. Chinese Journal of Science Education, 12, 241-263. Chan, W., Leu, Y, & Chen, C. (2007). Exploring group-wise conceptual deficiencies of fractions for fifth and sixth graders in Taiwan. Journal of Experimental Education, 76, 26-57. Chen, Y.-H., Gorin, J. S., Thompson, M. S., & Tatsuoka, K. K. (2008). Cross-cultural Validity of the TIMSS-1999 Mathematics Test: Verification of a Cognitive Model. International Journal of Testing. Cronbach, L.J. (1990). Essentials of Psychological Testing (5 th ed.). New York: Harper and Row. Eels, K.W., Davis, A., Havighurst, R.J., Herrick, V.E. & Tyler, R.W. (1951). Intelligence and cultural differences. Chicago: University of Chicago Press. Gierl, M.J., Bisanz, J., Bisanz, G.L., & Boughton, K.A. (2003). Identifying content and cognitive skills that produce gender differences in mathematics: A demonstration of the multidimensionality-based DIF analysis. Journal of Educational Measurement, 40(4), 281-

15 306. Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and Practice, 25, 21-36. Hambleton, R.K., & Traub, R.E. (1974). The effects of item order on test performance and stress. Journal of Experimental Education, 43, 40-46. Hamilton, L.S. (1999). Detecting gender-based differential item functioning on a constructedresponse science test. Applied Measurement in Education, 12(3), 211-35. Inabi, H., & Dodeen, H. (2006, Dec). Content analysis of gender-related differential item functioning TIMSS items in mathematics in Jordan, School Science & Mathematics, 106(8), 328-37. Jensen, A.R. (1980). Bias in mental testing. New York: MacMillan. Lane, S., Wang, N., & Magone, M. (1996). Gender-related differential item functioning on a middle-school mathematics performance assessment. Educational Measurement, 15, 21-27. Macdonald, G., Chen, Y.-H., & Leu, Y.-C. (2009). Exploring Cognitive Sources of Item Difficulty of Mathematic Fraction Items. Paper was presented at the annual meeting of the National Council on Measurement in Education, San Diego, California. Messick, S. (1989). Validity. In R.L.Linn, (Ed.), Educational measurement (3 rd ed., pp. 13-103). New York: MacMillan. Plake, B.S., Ansorge, C.J., Parker, C.S., & Lowry, S.R. (1982). Effects of item arrangement, knowledge of arrangement, test anxiety, and sex on test performance. Journal of Educational Measurement, 19,49-58. Roussos, L., & Stout, W. (1996, Dec). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355-371.

16 Ryan, K.E., & Chiu, S. (2001). An examination of item context effects, DIF, and gender DIF. Applied Measurement in Education, 14(1), 73-90. Ryan, K.E., & Fan, M. (1996). Examining gender DIF on a multiple-choice test of mathematics: A confirmatory approach. Educational Measurement: Issues and Practice, 15(4), 21-27. Shealy, R., & Stout, W. (1993). An item response theory model for test bias and differential test functioning. In P. Holland and H. Wainer (Eds.), Differential item functioning (pp.197-240). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Tatsuoka, K.K., Linn, R.L., Tatsuoka, M.M., & Yamamoto, K. (Winter 1988). Differential item functioning resulting from the use of different solution strategies. Journal of Educational Measurement, 25(4), 301-19. Walker, C.M., & Beretvas, S.N. (Summer 2001). An empirical investigation demonstrating the multidimensional DIF paradigm: A cognitive explanation for DIF. Journal of Educational Measurement, 38(2), 147-63. Walstad, W.B., & Robson, D. (Spring 1997). Differential item functioning and male-female differences on multiple-choice tests in economics. The Journal of Economic Education, 28, 155-71. Wiley, D.E. (1990). Test validity and invalidity reconsidered. In R. Snow & D.E. Wiley (Eds.), Improving inquiry in social science (pp.75-107). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Zumbo, B. D. (1999). A Handbook on the theory and methods of differential item functioning (DIF): Logistic Regression Modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, On: Directorate of Human Resources. University of Copenhagen (2008, December 18). Girls Have Superior Sense Of Taste To Boys.

17 ScienceDaily. Retrieved April 6, 2009, from http://www.sciencedaily.com /releases/2008/12/081216104035.htm

18 Table 1 Four Major Contents and Belonging Items in the Fraction Test Major Concept of Fraction Item 1. Basic Fraction Concept 4, 5, 7, 13, 15 2. Equal Sharing Concept 1, 2, 3, 10 3. Units Concept 6, 7, 9, 13, 14, 15, 16, 22, 23 4. Equivalent Fraction Concept 3, 5, 6, 8, 9, 11, 12, 17, 18, 19, 20, 21

19 Table 2 DIF Results from the Logistic Regression Approach Chisquare (Step 1) Effect size (Step 1) Chisquare (Step 2) Effect size (Step 2) Change in chisquare Change of effect size Uniform DIF (Yes or No) Item 1 655.37 0.296 661.08 0.299 5.71 0.003 Yes Item 2 768.65 0.383 777.65 0.387 9.00 0.004 Yes Item 3 1176.51 0.535 1176.74 0.535 0.23 0 No Item 4 132.49 0.13 132.51 0.13 0.02 0 No Item 5 301.09 0.255 301.54 0.256 0.45 0.001 No Item 6 1008.24 0.507 1011.38 0.508 3.14 0.001 No Item 7 796.21 0.352 804.2 0.355 7.99 0.003 Yes Item 8 518.35 0.245 518.38 0.245 0.03 0 No Item 9 1657.87 0.653 1678.59 0.659 20.72 0.006 Yes Item 10 1012.11 0.46 1012.3 0.46 0.19 0 No Item 11 1271.96 0.533 1285.43 0.537 13.47 0.004 Yes Item 12 1288.64 0.519 1291.43 0.52 2.79 0.001 No Item 13 82.63 0.045 83.59 0.046 0.96 0.001 No Item 14 859.29 0.375 877.43 0.381 18.14 0.006 Yes Item 15 307.82 0.15 307.94 0.15 0.12 0 No Item 16 275.96 0.15 280.56 0.152 4.6 0.002 Yes Item 17 831.06 0.397 831.56 0.397 0.5 0 No Item 18 598.16 0.363 604.66 0.367 6.5 0.004 Yes Item 19 1185.88 0.55 1215.45 0.56 29.57 0.010 Yes Item 20 300.59 0.148 306.37 0.151 5.78 0.003 Yes Item 21 798.3 0.358 802.66 0.36 4.36 0.002 Yes Item 22 386.03 0.183 386.32 0.183 0.29 0 No Item 23 420.21 0.257 420.8 0.258 0.59 0.001 No

20 Table 3 DIF Results from SIBTEST Analyses βuni Standard Error p-value DIF Item 1 0.048 0.018 0.009 Yes Item 2 0.047 0.015 0.002 Yes Item 3 0.012 0.014 0.405 No Item 4 No usable score cells Item 5 No usable score cells Item 6 No usable score cells Item 7 0.051 0.018 0.005 Yes Item 8 0.010 0.018 0.574 No Item 9-0.064 0.014 0.000 Yes Item 10 0.010 0.015 0.514 No Item 11-0.052 0.015 0.001 Yes Item 12-0.024 0.017 0.149 No Item 13 0.005 0.017 0.786 No Item 14 0.083 0.018 0.000 Yes Item 15-0.003 0.019 0.887 No Item 16-0.033 0.017 0.050 Yes Item 17-0.013 0.015 0.394 No Item 18 0.030 0.012 0.015 Yes Item 19-0.074 0.013 0.000 Yes Item 20 0.053 0.019 0.005 Yes Item 21-0.031 0.018 0.076 No Item 22-0.002 0.019 0.918 No Item 23-0.007 0.013 0.591 No

21 Table 4 DBF Results for Four Major Contents Major Concept of Fraction Item β p- value 1. Basic Fraction Concept 4, 5, 7, 13, 15 0.051.177 2. Equal Sharing Concept 1, 2, 3, 10 0.126.001 3. Units Concept 6, 7, 9, 13, 14, 15, 16, 22, 23 0.003.965 4. Equivalent Fraction Concept 3, 5, 6, 8, 9, 11, 12, 17, 18, 19, 20, 21-0.36.000

22 Table 5 DBF Results from SIBTEST Analyses for Cognitive Components Cognitive Component Item p-value DBF 1. Using Illustrations 1-6, 8-11, 13-15, 19, 21.225 No 2. Written Explanations 8, 13, 15, 18, 20-23.661 No 3. Judgment Application 12-14, 16, 23.327 No 4. Computation 4-6, 12, 13, 15, 17.098 No 5. Checking Options 1, 7, 12, 17.074 No 6. Spatial Folding 13, 16, 23.215 No 7. Solving Routine Problems 4, 5, 18.163 No

23 Substantive Interpretation of Cognitive Components A1 - Using Illustrations found in items: Q1, Q2, Q3, Q5, Q6, Q8, Q9, Q10, Q11, Q13, Q14, Q15, Q19 & Q21. Fourteen items had illustrations in the stem or as part of the distracters. Regardless of item difficulty the presence of an illustration changed the fraction item in a significant way. Theoretical support for the concept of illustration was subsequently found in the Manual of Attribute-Coding for General Mathematics in TIMSS Studies (Tatsuoka, Corter, & Gererro, 1995) items P7 and S3. Where P7 is defined as Be able to generate, visualize figures and graphs. Where S3 is defined as Be able to work with figures, tables, charts and graphs. In general the item became more difficult to solve given the presence of an illustration in a fraction item. This leads from inductive reasoning to the speculative conclusion that the human mind must marshal different resources in order to solve a fraction when an illustration is included regardless of how easy or difficult the item may be. A2 - Providing an Interpretation when solving the fraction item: Q8, Q13, Q15, Q18, Q20, Q21, Q22, & Q23 Eight items required more than simply solving the fraction item and or chosing the right multiple choice answer. In questions Q18-Q20 with the exception of Q19 the students were required to explain their multiple choice answer in writing. In Q8 there was no correct answer in the distracters so the student was required to write in the correct response. In Q13 & Q15 the students had to properly interpret the question before trying to solve it. In all cases, once again independent of item difficulty, the questions required interpretation or the answer would not be correct. Subsequent theoretical support was found in the Manual of Attribute-Coding for General Mathematics in TIMSS Studies (Tatsuoka, Corter, & Gererro, 1995) items S10.Where S10 is defined as Be able to work with open-ended questions. In general the item became more difficult to solve given the requirement to interpret the fraction item while solving the question. This leads from inductive reasoning to the speculative conclusion that the human mind must marshal different resources in order to solve a fraction and provide an intrepretaion while solving the item, regardless of how easy or difficult the item may be. A3 - Judgmental Application: Q12, Q13, Q14, Q23 In these questions a judgment is required for the student to get the item correct. There is no option for the student to compute the item and then give a right answer. The student must analyze the problem, restructure the problem and then make a judgment about the right answer. The requirement for the student to make a judgment made the item significantly more difficult. Further this cognitive skill tended to be found only in the more advanced questions. A4 Computation: Q4, Q5, Q6, Q12, Q13, Q15, & Q17 (P2)

24 In these questions the students were required to compute the fraction. Generally speaking, in this conceptual fraction test, students had to posessess a strong reading and conceptual ability. Often, being strong computationally would not help. In these questions being able to compute the right answer was a help. This was one of only two cognitive Components that actually made the item easier. In fact when students encountered these questions generally they found the question to be easier. A5- Checking Options: Q1, Q7, Q12 & Q17 This option was only found in four of the items but in order to solve the problem the students had to check the multiple choice answer. There was no other option in order to solve the problem. They could not reason it out or abstractly come to the right conclusion. The students had to use the multiple choice distracter or they could not answer the question. This requirement made the items more difficult. A6 Spatial folding: Q13, Q16, and Q23 These three items required the student to be able to mentally unfold the item. In item 13 the student needed to fold and unfold the ribbon in their minds. In item 16 the student was required to remove the three purple buttons from the jar in their mind and in item 23 the student had to consider the possibility of one quarter being bigger or smaller than one half. This requirement to reflect upon and spatially process the item in the mind made the question more difficult. In this case using spatial displacement (rotate mentally, reflection and exchange) or spatial distortion (add, remove and or shading) made no difference. A7 Solving Routine Problems Q4, Q5, & Q18 This twenty-three item conceptual fraction test did only offered the students two skills they could employ which made the items easier. In this case these questions were routine in nature and solving them followed known algorithms. As students encountered these questions the item became easier.

25 Table 6 A Q-Matrix for 7 Cognitive Components with Presence or Absence in 23 Questions Using Illustrations Written Interpretations Judgmental Application Computation Checking Options Spatial Unfolding Solving Routine Problems A1 A2 A3 A4 A5 A6 A7 Item 1 1 0 0 0 1 0 0 Item 2 1 0 0 0 0 0 0 Item 3 1 0 0 0 0 0 0 Item 4 1 0 0 1 0 0 1 Item 5 1 0 0 1 0 0 1 Item 6 1 0 0 1 0 0 0 Item 7 0 0 0 0 1 0 0 Item 8 1 1 0 0 0 0 0 Item 9 1 0 0 0 0 0 0 Item 10 1 0 0 0 0 0 0 Item 11 1 0 0 0 0 0 0 Item 12 0 0 1 1 1 0 0 Item 13 1 1 1 1 0 1 0 Item 14 1 0 1 0 0 0 0 Item 15 1 1 0 1 0 0 0 Item 16 0 0 1 0 0 1 0 Item 17 0 0 0 1 1 0 0 Item 18 0 1 0 0 0 0 1 Item 19 1 0 0 0 0 0 0 Item 20 0 1 0 0 0 0 0 Item 21 1 1 0 0 0 0 0 Item 22 0 1 0 0 0 0 0 Item 23 0 1 1 0 0 1 0