Having your cake and eating it too: multiple dimensions and a composite

Similar documents
Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Basic concepts and principles of classical test theory

André Cyr and Alexander Davies

Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC) BEAR Seminar April 22, 2014

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Diagnostic Classification Models

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Introduction to Test Theory & Historical Perspectives

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

Analyzing Teacher Professional Standards as Latent Factors of Assessment Data: The Case of Teacher Test-English in Saudi Arabia

Item Analysis: Classical and Beyond

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

LANGUAGE TEST RELIABILITY On defining reliability Sources of unreliability Methods of estimating reliability Standard error of measurement Factors

Equating UDS Neuropsychological Tests: 3.0>2.0, 3.0=2.0, 3.0<2.0? Dan Mungas, Ph.D. University of California, Davis

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Impact of Methods of Scoring Omitted Responses on Achievement Gaps

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

Turning Output of Item Response Theory Data Analysis into Graphs with R

Comprehensive Statistical Analysis of a Mathematics Placement Test

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Measuring and Assessing Study Quality

Technical Specifications

Author s response to reviews

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

A Brief Introduction to Bayesian Statistics

Models in Educational Measurement

Long Term: Systematically study children s understanding of mathematical equivalence and the ways in which it develops.

Answers to end of chapter questions

Module 14: Missing Data Concepts

Maximum Marginal Likelihood Bifactor Analysis with Estimation of the General Dimension as an Empirical Histogram

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

On the purpose of testing:

ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES

By Hui Bian Office for Faculty Excellence

1. Evaluate the methodological quality of a study with the COSMIN checklist

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Statistics for Psychosocial Research Session 1: September 1 Bill

STATS8: Introduction to Biostatistics. Overview. Babak Shahbaba Department of Statistics, UCI

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

The Effect of Guessing on Item Reliability

Measurement of Constructs in Psychosocial Models of Health Behavior. March 26, 2012 Neil Steers, Ph.D.

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

VARIABLES AND MEASUREMENT

Placebo and Belief Effects: Optimal Design for Randomized Trials

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Does factor indeterminacy matter in multi-dimensional item response theory?

UNIT 4 ALGEBRA II TEMPLATE CREATED BY REGION 1 ESA UNIT 4

Unit 1 Exploring and Understanding Data

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS

Statistical Methods and Reasoning for the Clinical Sciences

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)

Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/2007

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Regression Discontinuity Analysis

Structural Equation Modeling (SEM)

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Reliability, validity, and all that jazz

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

Adaptive EAP Estimation of Ability

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Chapter 11 Nonexperimental Quantitative Research Steps in Nonexperimental Research

Convergence Principles: Information in the Answer

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

In this chapter we discuss validity issues for quantitative research and for qualitative research.

Multi-level approaches to understanding and preventing obesity: analytical challenges and new directions

11/24/2017. Do not imply a cause-and-effect relationship

Bayesians methods in system identification: equivalences, differences, and misunderstandings

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

The application of Classical Test Theory (CTT) to the development of Patient- Reported Outcome Measures (PROMs) in Health Services Research

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

GUIDELINE COMPARATORS & COMPARISONS:

2013 Supervisor Survey Reliability Analysis

A Comparison of Several Goodness-of-Fit Statistics

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Reviewing the TIMSS Advanced 2015 Achievement Item Statistics

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously

Tech Talk: Using the Lafayette ESS Report Generator

On the Targets of Latent Variable Model Estimation

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

The Regression-Discontinuity Design

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

commentary Time is a jailer: what do alpha and its alternatives tell us about reliability?

Daniel Boduszek University of Huddersfield

SWITCH Trial. A Sequential Multiple Adaptive Randomization Trial

CFPB Financial Well-Being Scale

Smiley Faces: Scales Measurement for Children Assessment

Transcription:

Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018

outline Motivating example Different modeling approaches Composite Model Reliability Plausible values Empirical Example

Micro- and macro- level individual dimensions summative combination of those multiple dimensions composite three main modeling options: the uni- and multidimensional the bi-factor model the higher-order model

Micro- and macro- level Mathematics Achievement Algebra Geometry Statistics Administrators : What is the mathematics achievement of students? Teachers: Which topic needs closer attention?

Classical Test Theory

Item Response Theory

Bifactor model a serious limitation for interpretation for this context not useful for practitioners

Bifactor model Perhaps the methodologists who are promoting this model know some secret unknown to the authors, but we have no conceptualization what such things ( Algebra uncorrelated with Mathematics Achievement, Geometry uncorrelated with Mathematics Achievement and Statistics uncorrelated with Mathematics Achievement ) might be, and/or how they could be interpreted. (Wilson & Gochyyev, forthcoming, p.7)

Second-order (higher-order) model the lower order estimates are a linear function of the higher order estimate if the relationship is linear: each person has only one estimate (the higher-order one) the lower-order ones are all determined by that.

Composite model Assumptions The sub-test level (the parts ) are the main focus for measurement The sum-total level (the whole ) is needed for other pragmatic uses Two parts: 1. a multidimensional model for the sub-tests 2. a predictive model for a composite of the latent variables based on each sub-test

Composite model: hybrid of two measurement traditions reflective measurement dominant trend latent variable is seen as being the source of the responses to the items formative measurement items are seen as being the source of the general variable

Composite model Howell, Breivik & Wilcox (2007, p. 205): formative measurement is not an equally attractive alternative to reflective measurement and that whenever possible, in developing new measures or choosing among alternative existing measures, researchers should opt for reflective measurement. we agree the key: which level of the measurement should be optimized? in the educational context: level of the sub-tests should be optimized reflective measurement at the sub-test level

Estimation

Weighting Schemes Weighting by the number of items ( item-frequency weighting ) not ideal confounded by design-related decisions implicitly encoded in the unidimensional modeling approach Reliability weighting: the more reliable the score for a dimension, the higher the weight it gets affected by the number of items for that dimension

Weighting Schemes Weighting by mean item difficulty ( item-difficulty weighting ) if a dimension s items are more difficult, that dimension should have a higher weight in the composite one should either use a proportion correct or IRT difficulties obtained from the unidimensional model if one finds that dimensions-specific difficulty means differ substantially, this may hint towards possible design flaws as a good practice in instrument design, one should aim to have items from each dimension to span the ability continuum.

Weighting Schemes Weighting by intended use ( consequential weighting ) not all strands are created equally depending on the grade level, some topics/content-areas dominate the school year compared to others adjusting the weights accordingly by giving more weight to topics that are covered more might be useful for one important reason: reflecting in the test the apparent amount of a topic in the curriculum particularly relevant in educational achievement testing

Common scale across dimensions often overlooked regardless of how insensible it sounds justifies the combination of these dimensions into a single summary score (the composite score) option 1: construction of composite scores after aligning the different dimensions option 2: implement this alignment within an estimation routine itself dimensions will be forced into a common metric

Reliability of the composite

EAP reliability EAP: mean of the posterior distribution The variance of the posterior is used to represent uncertainty Mislevy, Beaton, Kaplan & Sheehan (1992): reliability can be viewed as the amount by which the measurement process has reduced uncertainty in the prediction of each individual s ability R E s = 1- s 2 p 2 var = s EAP ( q ) 2

Variance and reliability for the composite To construct this model-based variance estimate for the composite, we use plausible values (PVs: Mislevy et al, 1992) (1) randomly generate 5 PVs for each person and for each dimension (2) obtain the composite score resulting from each draw (using weights) (3) estimate the variance for each of the 5 composite distributions (4) average the variance across five draws To obtain EAP reliability divide the observed variance of the composite (obtained from dimensions-specific EAP scores) with the variance obtained from the above steps

Alternative reliability for the composite Reliability Coefficient (Spearman, 1910): The correlation between one half and the other half of several measures of the same thing classical formulation of reliability: correlation between two random measurements of the composite using PVs as above, obtain correlations between each pair of the 5 composite distributions, and calculate the mean of the 10 possible pairings (i.e., ((5!)/(3!2!) = 10).

Example: ADM Data Modeling curriculum designed to improve middle school students statistical reasoning schools were randomly assigned treatment/control pre- and post-test we used data from the posttest five sub-dimensions (domains): Data Display (DAD) Models of Variability (MOV) Chance (CHA) Concepts of Statistics (COS) Informal Inference (INI) due to the very high correlation between DAD and INI dimensions, we combined these two dimensions 25 items: DAD (11); COS (8); CHA (3); MOV (3)

Example: multidimensional Rasch model unidimensional Rasch: variance: 0.411 (0.024) EAP reliability of 0.89; Cronbach s Alpha of 0.87.

Example: multidimensional Rasch model

Example: naïve correlations overestimated due to the correlated bivariate priors when computing EAP estimates EAP estimates are shrunken towards each other, and the amount of shrinkage depends (inversely) on their reliabilities

Example: Bifactor model the latent variable correlation between the common and the unidimensional latent variable is estimated at 0.855 calculated using plausible values for the unidimensional latent variable, and using the reliability of the common factor to correct for the overestimation of the EAP correlations

Example: Bifactor model naïve correlations

Example: Second-order model the latent variable correlation between the common and the unidimensional latent variable is estimated at 0.856 (calculated using plausible values for the unidimensional latent variable, and using the reliability of the overall factor to correct for the overestimation)

Example: Second-order model Correlations between latent variables Naïve correlations (between EAP estimates)

Example: Composite model with equal weights The latent variable correlation between the composite and the unidimensional latent variable: 0.84

Example: Composite model with reliability weights The latent variable correlation between the composite and the unidimensional latent variable: 0.85

Conclusion inherently multidimensional contexts ( the parts ) nevertheless also include a certain level of interest in the overarching combination of those multiple dimensions ( the whole ) using the uni- and multidimensional pair of modeling techniques can give both perspectives to bring them together under a single analytic umbrella, the composite model offers some very useful advantages we see it as being readily useful quite broadly to address a very long-standing measurement problem.

thank you questions? perman@berkeley.edu markw@berkeley.edu