Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018
outline Motivating example Different modeling approaches Composite Model Reliability Plausible values Empirical Example
Micro- and macro- level individual dimensions summative combination of those multiple dimensions composite three main modeling options: the uni- and multidimensional the bi-factor model the higher-order model
Micro- and macro- level Mathematics Achievement Algebra Geometry Statistics Administrators : What is the mathematics achievement of students? Teachers: Which topic needs closer attention?
Classical Test Theory
Item Response Theory
Bifactor model a serious limitation for interpretation for this context not useful for practitioners
Bifactor model Perhaps the methodologists who are promoting this model know some secret unknown to the authors, but we have no conceptualization what such things ( Algebra uncorrelated with Mathematics Achievement, Geometry uncorrelated with Mathematics Achievement and Statistics uncorrelated with Mathematics Achievement ) might be, and/or how they could be interpreted. (Wilson & Gochyyev, forthcoming, p.7)
Second-order (higher-order) model the lower order estimates are a linear function of the higher order estimate if the relationship is linear: each person has only one estimate (the higher-order one) the lower-order ones are all determined by that.
Composite model Assumptions The sub-test level (the parts ) are the main focus for measurement The sum-total level (the whole ) is needed for other pragmatic uses Two parts: 1. a multidimensional model for the sub-tests 2. a predictive model for a composite of the latent variables based on each sub-test
Composite model: hybrid of two measurement traditions reflective measurement dominant trend latent variable is seen as being the source of the responses to the items formative measurement items are seen as being the source of the general variable
Composite model Howell, Breivik & Wilcox (2007, p. 205): formative measurement is not an equally attractive alternative to reflective measurement and that whenever possible, in developing new measures or choosing among alternative existing measures, researchers should opt for reflective measurement. we agree the key: which level of the measurement should be optimized? in the educational context: level of the sub-tests should be optimized reflective measurement at the sub-test level
Estimation
Weighting Schemes Weighting by the number of items ( item-frequency weighting ) not ideal confounded by design-related decisions implicitly encoded in the unidimensional modeling approach Reliability weighting: the more reliable the score for a dimension, the higher the weight it gets affected by the number of items for that dimension
Weighting Schemes Weighting by mean item difficulty ( item-difficulty weighting ) if a dimension s items are more difficult, that dimension should have a higher weight in the composite one should either use a proportion correct or IRT difficulties obtained from the unidimensional model if one finds that dimensions-specific difficulty means differ substantially, this may hint towards possible design flaws as a good practice in instrument design, one should aim to have items from each dimension to span the ability continuum.
Weighting Schemes Weighting by intended use ( consequential weighting ) not all strands are created equally depending on the grade level, some topics/content-areas dominate the school year compared to others adjusting the weights accordingly by giving more weight to topics that are covered more might be useful for one important reason: reflecting in the test the apparent amount of a topic in the curriculum particularly relevant in educational achievement testing
Common scale across dimensions often overlooked regardless of how insensible it sounds justifies the combination of these dimensions into a single summary score (the composite score) option 1: construction of composite scores after aligning the different dimensions option 2: implement this alignment within an estimation routine itself dimensions will be forced into a common metric
Reliability of the composite
EAP reliability EAP: mean of the posterior distribution The variance of the posterior is used to represent uncertainty Mislevy, Beaton, Kaplan & Sheehan (1992): reliability can be viewed as the amount by which the measurement process has reduced uncertainty in the prediction of each individual s ability R E s = 1- s 2 p 2 var = s EAP ( q ) 2
Variance and reliability for the composite To construct this model-based variance estimate for the composite, we use plausible values (PVs: Mislevy et al, 1992) (1) randomly generate 5 PVs for each person and for each dimension (2) obtain the composite score resulting from each draw (using weights) (3) estimate the variance for each of the 5 composite distributions (4) average the variance across five draws To obtain EAP reliability divide the observed variance of the composite (obtained from dimensions-specific EAP scores) with the variance obtained from the above steps
Alternative reliability for the composite Reliability Coefficient (Spearman, 1910): The correlation between one half and the other half of several measures of the same thing classical formulation of reliability: correlation between two random measurements of the composite using PVs as above, obtain correlations between each pair of the 5 composite distributions, and calculate the mean of the 10 possible pairings (i.e., ((5!)/(3!2!) = 10).
Example: ADM Data Modeling curriculum designed to improve middle school students statistical reasoning schools were randomly assigned treatment/control pre- and post-test we used data from the posttest five sub-dimensions (domains): Data Display (DAD) Models of Variability (MOV) Chance (CHA) Concepts of Statistics (COS) Informal Inference (INI) due to the very high correlation between DAD and INI dimensions, we combined these two dimensions 25 items: DAD (11); COS (8); CHA (3); MOV (3)
Example: multidimensional Rasch model unidimensional Rasch: variance: 0.411 (0.024) EAP reliability of 0.89; Cronbach s Alpha of 0.87.
Example: multidimensional Rasch model
Example: naïve correlations overestimated due to the correlated bivariate priors when computing EAP estimates EAP estimates are shrunken towards each other, and the amount of shrinkage depends (inversely) on their reliabilities
Example: Bifactor model the latent variable correlation between the common and the unidimensional latent variable is estimated at 0.855 calculated using plausible values for the unidimensional latent variable, and using the reliability of the common factor to correct for the overestimation of the EAP correlations
Example: Bifactor model naïve correlations
Example: Second-order model the latent variable correlation between the common and the unidimensional latent variable is estimated at 0.856 (calculated using plausible values for the unidimensional latent variable, and using the reliability of the overall factor to correct for the overestimation)
Example: Second-order model Correlations between latent variables Naïve correlations (between EAP estimates)
Example: Composite model with equal weights The latent variable correlation between the composite and the unidimensional latent variable: 0.84
Example: Composite model with reliability weights The latent variable correlation between the composite and the unidimensional latent variable: 0.85
Conclusion inherently multidimensional contexts ( the parts ) nevertheless also include a certain level of interest in the overarching combination of those multiple dimensions ( the whole ) using the uni- and multidimensional pair of modeling techniques can give both perspectives to bring them together under a single analytic umbrella, the composite model offers some very useful advantages we see it as being readily useful quite broadly to address a very long-standing measurement problem.
thank you questions? perman@berkeley.edu markw@berkeley.edu