Models in Educational Measurement Jan-Eric Gustafsson Department of Education and Special Education University of Gothenburg
Background Measurement in education and psychology has increasingly come to rely on explicitly formulated statistical models. Statistical models offer power, precision and flexibility by focusing on a few essential aspects of the phenomenon under study. But if the model assumptions do not agree with the phenomenon, inferences may be incorrect.therefore, issues of model fit are centrally important. However, it is not always an easy task to determine whether a model fits data or not, or which the consequences are of model misfit.
The Rasch model According to the Rasch model the probability of a correct response to a test item is a function of the ability of the person, and of the difficulty of the item: As the ability of the person increases, the probability of a correct answer increases. As the difficulty of the item increases, the probability of a correct answer decreases. Under certain assumptions, it is possible to estimate the difficulty of items independently of the ability of persons, and to estimate the ability of persons independenty of the difficulty of the items.
The Rasch model, cont Having estimated the difficulty parameters of a set of items they can be used to construct different tests, which all measure the same ability. Different test-takers can be given different items to measure one and the same ability. This offers great advantages for solving practical measurement problems, such as adaptive testing, horizontal and vertical linking of tests, and constructing matrix-sampling test designs.
Does the model fit data? The Rasch model is attractive, because if it fits data, the properties of the model guarantee simple and powerful solutions of practical measurement problems. Assumptions: Unidimensionality Homogeneous discrimination of all items These assumptions may be tested with statistical tests constructed within the framework of the Rasch model.
Gustafsson, J. E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematical and Statistical Psychology, 33(2), 205-233. Two categories of statistical tests: ICC-tests investigate if item parameters are invariant across subsets of persons (e.g., high scorers vs. low scorers; boys vs girls) PCC-tests investigate if ability parameters are invariant across subsets of items Some results: ICC-tests do not detect multidimensionality. Muthén (1978) analyzed 15 locus of control items with a new method for factor analysis of dichotomous variables and found three lowly correlated factors. However, according to an ICC test the Rasch model had good fit to these data. The PCC-test supported the conclusion that there were three separate dimensions. The statistical power of the ICC-tests is strongly dependent on sample size and on the heterogeneity of the sample. The Rasch model does not fit speeded tests or tests which allow guessing
Gustafsson, J. E. & Lindblad, T. (1978) The Rasch model for dichotomous items: A solution of the conditional estimation problem for long tests and some thoughts about item screening procedures. Paper presented at the European Conference on Psychometrics and Mathematical Psychology, Uppsala, June 15 17, 1978. The Rasch model was used to analyze a test of English grammar for Swedish students. The model had poor fit, which was primarily due to a set of items measuring knowledge of irregular verbs having too high discrimination. In separate analyses, good fit was found for the irregular verb items, as well as for the other items, after some poorly constructed items had been excluded.
What to do when model fit is poor? Exclude the offending items This would have caused unacceptable construct underrepresentation. It also would have been illogical because too high or too low discrimination typically is not an intrinsic characteristic of an item, but rather whether the other items have similar discrimination or not Put the problematic items in a separate scale This would have been impractical, unless we aimed to differentiate between different domains of English grammar. But to do that reliably we would need more items testing irregular verbs Turn to another, less restrictive, model such as Verhelsts OPLM model which is a Rasch model which allows different but fixed discrimination parameters. This model was not developed at the time Keep the items in the test and accept the poor fit This could imply loss of credibility
Robustness George Box: Essentially, all models are wrong, but some are useful Many applications of the Rasch model and other IRT models, such as TIMSS and PISA define both an overall score, and subscores for different domains or processes. This must be a violation of the unidimensionality assumption. Still, this practice seems meaningful and useful. Coefficient α is often described as not being based on any strict assumptions but the formula is in fact based on the same assumptions as the Rasch model: unidimensionality and homogeneous item discrimination. If the assumptions are violated, α is underestimated. However, even in the presence of large variation in the item discrimination, the underestimation is marginal (e,g,, Reuterberg & Gustafsson, 1992)
Some conclusions It is difficult to assess the fit of the Rasch model It is even more difficult to develop well-fitting models There is a risk of conflict between the model requirements and the validity of the test Use of the model needs to rely on trust in robustness The capability of the Rasch model to deal with issues of multidimensionality is limited
Dimensionality of cognitive abilities Factor analysis was invented to investigate the dimensionality of variables: Spearman invented factor analysis to test the hypothesis that individual differences in cognition can be captured by a g-factor Thurstone invented exploratory factor analysis and demonstrated that there are seven primary mental abilities. Followers of Thurstone extended this number to at least 100 primary abilities Cattell applied factor analysis to correlations among factors to identify second- and third-order factors and introduced the distinction between Fluid Intelligence (Gf) and Crystallized Intelligence (Gc) Jöreskog developed confirmatory factor analysis and structural equation modeling allowing flexible and powerful building and testing of latent variable models
Gustafsson, J. E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8(3), 179-203.
Does the model hold? The g = Gf relation was replicated in several studies, but far from all Carroll s (1993) meta-analysis did not replicate the perfect relation but it showed that Gf was the broad ability most highly related to g.
Valentin Kvist, A., & Gustafsson, J. E. (2008). The relation between fluid intelligence and the general factor as a function of cultural background: A test of Cattell's investment theory. Intelligence, 36(5), 422-436. The correlation between g and Gf was.83 in a heterogeneous group of adults, but correlations of.98-.99 were found within each of the three sub-groups Non-immigrants, European immigrants; Non-European immigrants. Explanation: Gf is a determinant of learning in all domains, along with motivation, effort and opportunity to learn When everyone has equal opportunity to learn, Gf influences learning and development in all domains, and so it becomes a general factor. If subgroups of persons have had different opportunities to learn certain domains, the generality of Gf breaks down. These results provide general support for Cattell s Investment theory
The Investment theory The Investment theory basically says that Gf is a causal factor in the development of individual differences in learning. If we knew the mechanisms through which Gf influences development of fundamental skills such as decoding and vocabulary we would have a better basis for educational interventions.
Methodological aspects of the hierarchical model The standard view of measurement implies that the phenomenon can be described in terms of a set of correlated dimensions, which all are unidimensional. However, some constructs are broad and encompass a very wide range of phenomena (e.g., g), others are broad and encompass wide domains of phenomena (e.g., Gc) while other constructs are narrow and encompass a more limited range of phenomena (e.g., knowledge of irregular verbs). The constructs differ in referent generality. When the unidimensionality requirement is imposed this has implied a focus on constructs with narrow referent generality. Typically it has had the consequence that broad constructs have been splintered into more and more narrow constructs, as happened in the research on cognitive abilities.
Gustafsson, J.E. (2002). Measurement from a hierarchical point of view. In In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.) The role of constructs in psychological and educational measurement (pp. 73-95). London: Lawrence Erlbaum Associates, Publishers. Three propositions: To measure constructs with high referent generality it is necessary to use heterogeneous measurement devices. A homogenous test always measures several dimensions. To measure constructs with low referent generality it is also necessary to measure constructs with high generality.
Measurement from a hierarchical point of view, cont The principle of aggregation: Aggregation causes the general factor to account for a larger proportion of variance in the sum of scores than it does in each observed measure. Each observed variable is complex, but aggregate scores may be essentially unidimensional Aggregation over broad domains of performance is a way to approximate unidimensionality so that robust use of the Rasch model and other IRT models may still be possible.
Grading in Sweden In Sweden grades have always been high-stakes, because they have been the primary instrument for eligibility and selection to the next level of the educational system Teachers have always been trusted to grade their students Exams were abolished in the 1960s and standardized testing has traditionally had a comparably limited role Up until the mid 1990s the grading system was normreferenced, but in 1998 a criterion-referenced grading system was introduced.
The norm-referenced grading system The norm-referenced grading system was developed in the 1940s by Frits Wigforss (SOU 1942:11), after it had been observed that the grades used for admission to grammar school ( realskola ) lacked severely in comparability across schools and teachers The proposed system specified that grades should be normally distributed in the population, with a specified percentage of the students at each step of the grading scale So called standard tests were developed to guide the teachers grading at the class-level With the introduction of the comprehensive school ( grundskola ) in 1962 a five-step grading scale (1-5) was introduced, without any pass level
Critique of the norm-referenced grading system The norm-referenced grading system was criticized on many grounds: It inspired competition rather than cooperation It was unfair to students in different classes ( There are no 5s left ) Because the grade distribution was specified to be Normal (3,1) in the population the grades could not be used to describe change in levels of knowledge and skills Along with a curriculum reform in 1994, the norm-referenced grades were abolished, and a criterion-referenced system was introduced
The criterion-referenced grading system The system which was first put to use in 1998 had a scale with 4 steps: Pass with Special Distinction (MVG), Pass with Distinction (VG), Pass (G) and Fail (IG) In 2011 a new scale with six steps was introduced (A-F). F = fail According to the original plans, the number of failed students was expected be a few percentage points, but the first results showed the percentage of failed students to be much higher (9 %). It has since increased to 14 % The grading is guided by verbally formulated knowledge requirements for the different steps of the grading scale
Knowledge requirements (partial) for grades /E/C/A/ at the end of year 9 Grade E: Pupils can choose and use basically functional mathematical methods with some adaptation to the context in order to make calculations and solve routine tasks in arithmetic, algebra, geometry, probability, statistics, and also relationships and change with satisfactory results. Grade C: Pupils can choose and use appropriate mathematical methods with relatively good adaptation to the context in order to make calculations and solve routine tasks in arithmetic, algebra, geometry, probability, statistics, and also relationships and change with good results. Grade A: Pupils can choose and use appropriate and effective mathematical methods with good adaptation to the context in order to make calculations and solve routine tasks in arithmetic, algebra, geometry, probability, statistics, and also relationships and change with very good results.
Gustafsson, J.-E., Cliffordson, C., & Erickson. G. (2014). Likvärdig kunskapsbedömning i och av den svenska skolan problem och möjligheter [Equitable knowledge assessment in and of the Swedish school problems and possibilitties]. Stockholm: SNS Förlag Substantial grade inflation, particularly in upper secondary school Considerable variation in grading practices among teachers and schools Instability in the national tests across years and subjects These problems, along with several others, seem to be due to the lack of precision in the verbally formulated knowledge requirements for the different steps on the grading scale. Wigforss (1942) concluded that it is not possible to achieve sufficient comparability in grading based on verbally formulated criteria, which was why he developed the norm-referenced grading system.
Olsen, R.V., & Nilsen, T. (in press). Standard setting in PISA and TIMSS. In S. Blömeke & J.E. Gustafsson (Eds.), Standard Setting in Education - The Nordic Countries in an International Perspective, New York: Springer Publishing The authors compare and discuss similarities and differences in the way PISA and TIMSS set and formulate descriptions of standards or do scale anchoring (International Benchmarks in TIMSS based on a curriculum model; Proficiency Levels in PISA based on a competence model). Focus on the empirical basis for development of performancelevel descriptors. Their interest in standard setting and performance-level descriptions were partially driven by the fact that the Norwegian grading system has problems of comparability in grading.
An generic example of an item map (from Olsen & Nilsen, in press) Decide on the number and location of cut-scores to be used Develop Performance-Level Descriptors (PLDs) based on descriptions of the clusters of items identified and on the general description of the construct stated in the framework. This typically requires lots of items, given that the PLDs should not be formulated in item specific terms.
TIMSS International Benchmarks (partial) Low (400): Students have some knowledge of whole numbers and decimals, operations, and basic graphs. Intermediate: (475) Students can solve problems involving decimals, fractions, proportions, and percentages in a variety of settings. For example, they can determine proportions of a whole in order to construct pie charts and calculate unit prices to solve a problem. High (550): Students can use information from several sources to solve problems involving different types of numbers and operations. Students can relate fractions, decimals, and percents to each other. They can solve problems with fractions, proportions, and percentages. Students show understanding of whole number exponents. They can identify the prime factorization of a given number. Advanced (625): Students can solve a variety of fraction, proportion, and percent problems and justify their conclusions. They can reason with different types of numbers, including whole numbers, negative numbers, fractions, and percentages in abstract and non-routine situations. For example, given two points on a number line representing unspecified fractions, students can identify the point that represents their product.
PLDs and grading Empirically based PLDs could potentially provide a more stable foundation to support class-room based assessment and criterion-referenced grading than the currently used knowledge requirements, if formulated at an appropriate level of abstraction Linking national tests to PISA, TIMSS and the other international studies could provide a broader basis for constructing PLDs Dimensionality?
Individual differences versus development of competence The PLDs could, perhaps, be developed into empirically and theoretically based descriptions of learning trajectories, which could inform curricula, instruction and assessment The classical measurement models focus on individual differences: The notions of dimensionality, discrimination, reliability and validity are defined in terms of variance and covariance, and their application requires that the population of persons is defined. These models have limited applicability for the study of individual growth. The main aim of education is to support development of competence, so in educational measurement issues of development should be a central concern The tensions between differential and developmental psychology illustrate the difficulties to integrate research on individual differences and development. However, progress has lately been made through growth curve modeling and applications of IRT to solve measurement problems. Hopefully we will see more of integration of differential and developmental approaches in the future.