Basic concepts and principles of classical test theory

Similar documents
CHAPTER VI RESEARCH METHODOLOGY

Subescala D CULTURA ORGANIZACIONAL. Factor Analysis

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Models in Educational Measurement

Subescala B Compromisso com a organização escolar. Factor Analysis

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Statistics for Psychosocial Research Session 1: September 1 Bill

APÊNDICE 6. Análise fatorial e análise de consistência interna

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

PÄIVI KARHU THE THEORY OF MEASUREMENT

ESTABLISHING VALIDITY AND RELIABILITY OF ACHIEVEMENT TEST IN BIOLOGY FOR STD. IX STUDENTS

Running head: CFA OF TDI AND STICSA 1. p Factor or Negative Emotionality? Joint CFA of Internalizing Symptomology

By Hui Bian Office for Faculty Excellence

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

PTHP 7101 Research 1 Chapter Assignments

Personal Style Inventory Item Revision: Confirmatory Factor Analysis

Technical Specifications

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

Validity, Reliability, and Fairness in Music Testing

Chapter 3. Psychometric Properties

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Small Group Presentations


Class 7 Everything is Related

Internal structure evidence of validity

The MHSIP: A Tale of Three Centers

International Conference on Humanities and Social Science (HSS 2016)

EFFECTS OF ITEM ORDER ON CONSISTENCY AND PRECISION UNDER DIFFERENT ORDERING SCHEMES IN ATTITUDINAL SCALES: A CASE OF PHYSICAL SELF-CONCEPT SCALES

Running head: CFA OF STICSA 1. Model-Based Factor Reliability and Replicability of the STICSA

SPSS output for 420 midterm study

Validity and reliability of measurements

Modeling the Influential Factors of 8 th Grades Student s Mathematics Achievement in Malaysia by Using Structural Equation Modeling (SEM)

Proof. Revised. Chapter 12 General and Specific Factors in Selection Modeling Introduction. Bengt Muthén

SPSS output for 420 midterm study

Reliability. Internal Reliability

ADMS Sampling Technique and Survey Studies

VARIABLES AND MEASUREMENT

Having your cake and eating it too: multiple dimensions and a composite

CHAPTER-III METHODOLOGY

CHAPTER 3 METHOD AND PROCEDURE

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Paul Irwing, Manchester Business School

Title: The Theory of Planned Behavior (TPB) and Texting While Driving Behavior in College Students MS # Manuscript ID GCPI

Psychologist use statistics for 2 things

Anumber of studies have shown that ignorance regarding fundamental measurement

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

REPORT. Technical Report: Item Characteristics. Jessica Masters

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

Development of self efficacy and attitude toward analytic geometry scale (SAAG-S)

Confirmatory Factor Analysis of the Procrastination Assessment Scale for Students

Associate Prof. Dr Anne Yee. Dr Mahmoud Danaee

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

CHAPTER 3. Research Methodology

Running Head: MULTIPLE CHOICE AND CONSTRUCTED RESPONSE ITEMS. The Contribution of Constructed Response Items to Large Scale Assessment:

Answer Key to Problem Set #1

HPS301 Exam Notes- Contents

Saville Consulting Wave Professional Styles Handbook

Psychometric Instrument Development

Psychometric Instrument Development

Psychometric Instrument Development

Critical Thinking Assessment at MCC. How are we doing?

A Brief Introduction to Bayesian Statistics

How!Good!Are!Our!Measures?! Investigating!the!Appropriate! Use!of!Factor!Analysis!for!Survey! Instruments!

Business Research Methods. Introduction to Data Analysis

Measurement is the process of observing and recording the observations. Two important issues:

Research Questions and Survey Development

Comprehensive Statistical Analysis of a Mathematics Placement Test

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Connectedness DEOCS 4.1 Construct Validity Summary

Analysis and Interpretation of Data Part 1

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

Midterm Exam MMI 409 Spring 2009 Gordon Bleil

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

The Effect of Guessing on Item Reliability

Psychometric Instrument Development

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Validity and reliability of measurements

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

ASSESSING THE UNIDIMENSIONALITY, RELIABILITY, VALIDITY AND FITNESS OF INFLUENTIAL FACTORS OF 8 TH GRADES STUDENT S MATHEMATICS ACHIEVEMENT IN MALAYSIA

Knowledge as a driver of public perceptions about climate change reassessed

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

André Cyr and Alexander Davies

validscale: A Stata module to validate subjective measurement scales using Classical Test Theory

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

RESULTS. Chapter INTRODUCTION

Collecting & Making Sense of

Packianathan Chelladurai Troy University, Troy, Alabama, USA.

Validity, Reliability and Classical Assumptions

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

1. Evaluate the methodological quality of a study with the COSMIN checklist

Discriminant Analysis with Categorical Data

On the purpose of testing:

Transcription:

Basic concepts and principles of classical test theory Jan-Eric Gustafsson

What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must be defined in theoretical terms. Measurement should be understood in a broad sense and encompasses also classification and assessment.

Why should one measure? Increases precision and comparability Gives access to a well developed set of tools for developing measurement instruments" collecting data" summarizing and describing data" analyzing data" making inferences and generalizations" However: Measurement suits certain purposes but not others

Classical and modern theories of measurement Theories of measurement to a large extent deal with how to put together components (items or subscales) into scales with known properties: Classical theories of measurement assume simple relations between the components and the dimension to be measured. Measures of test properties are typically group-dependent. " Modern theories of measurement (IRT) are based on probabilistic models of relations between item scores and characteristics of persons and item. This allows for estimation of ability from different items for different persons, and for estimating item characteristics which are invariant over groups of persons. "

An example: the IEA Reading Literacy study 1991 In 1991 some 4 500 Swedish students in grade 3 participated in a study of reading literacy along with samples of students in about 30 other countries (RL 1991). A large number of instruments: Reading literacy tests: 15 texts from three categories and 66 multiple choice items." Student questionnaire: Questions about home background, attitudes towards reading, and reading habits." Parental questionnaire: Literacy activities in the home, economic and cultural resources, reading habits and attitudes, relations between home and school, education and occupation." Teacher questionnaire: Questions about the class, the teaching of reading, resources and the teacher." School questionnaire: Characteristics of the school, resources, school climate, and relations between home and school."

Starting points for the construction of the reading literacy test Definition: Reading literacy is the ability to understand and use such forms of written language which are required in society and/or are of value for the individual Requirements on the texts: The students should not have met them before" The texts should be possible to use again after 10 years" The texts should be appropriate for all countries, languages, ethnic and socioeconomic groups and both genders." They should be possible to use stand alone in such a way that they could provide a meaningful reading experience" They should not be formulated in such general terms that the students would be able to answer the questions without reading the texts" They should comprise different levels of difficulty"

The reading literacy test Three types of texts Narrative prose. These are continous texts which aim to tell a story. The texts typically follow a linear time sequence, and are usually intended to entertain or to involve the reader emotionally. The texts ranged in length from short fables to longer stories." Expository prose. This category comprises continuous texts which aim to convey factual information or opinion to the reader. " Documents. These are structured presentations of information, in the form of graphs, charts, maps, lists, or sets of instructions. The reader can process the information in a nonlinear fashion without reading the whole text, and typically the number of words is limited." Items" In relation to each text between two and six questions were asked." Altogether there were 66 multiple-choice items and two open-ended questions. The latter were not included in the analysis because of too low inter-rater agreement." Booklets The 15 texts and the 66 multiple-choice items were distributed over two booklets (A and B)."

Test components This test, and most other tests, thus consists of different types of components: Single questions (items), which here are scored 0 (incorrect choice) and 1 (correct choice)" Text passages (testlets, parcels, or item bundles) with between 2 and 6 as the miximum score." Booklets" If test components (items) are independent, there is more flexibility and power in designing tests than if there are dependencies among the components."

A minitest A minitest has been created from the items (10) to two of the narrative texts ( Bird and Shark )

Statistical measures for the minitest Descriptive Statistics minitest Valid N (listwise) N Minimum Maximum Mean Std. Deviation 5101.00 10.00 7.2151 2.26547 5101

Distribution of scores for the minitest

Means and standard deviations of the items Item Statistics nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r Mean Std. Deviation N.57.496 5101.78.417 5101.51.500 5101.83.375 5101.94.239 5101.88.323 5101.83.378 5101.39.488 5101.70.457 5101.79.407 5101

Correlations among the items Inter-Item Correlation Matrix nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r 1.000.181.288.303.181.217.187.250.236.227.181 1.000.171.275.206.219.167.145.164.191.288.171 1.000.292.186.205.184.219.216.214.303.275.292 1.000.388.317.249.205.236.303.181.206.186.388 1.000.278.199.109.173.230.217.219.205.317.278 1.000.341.189.245.313.187.167.184.249.199.341 1.000.201.241.221.250.145.219.205.109.189.201 1.000.253.239.236.164.216.236.173.245.241.253 1.000.309.227.191.214.303.230.313.221.239.309 1.000

Relations between items and the total score Item-Total Statistics nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r Scale Corrected Squared Cronbach's Scale Mean if Variance if Item-Total Multiple Alpha if Item Item Deleted Item Deleted Correlation Correlation Deleted 6.65 4.054.417.184.714 6.44 4.388.327.120.727 6.70 4.087.394.165.719 6.38 4.218.502.296.702 6.28 4.691.372.195.725 6.33 4.417.450.239.712 6.39 4.383.384.176.719 6.83 4.167.366.146.723 6.51 4.154.413.183.714 6.42 4.226.443.214.710

Reliability The precision of an instrument How well an instrument resists the influence of random variation Does the instrument give the same result upon repeated measurements?

Definition of reliability Observed score = True score + Error An instruments correlation with itself The ratio between true score variance and observed score variance (observed score variance = true score variance + error variance) What is a true score and what is error?

Sources of variance in test scores (after Thorndike, 1951) Individual characteristics External/situational factors

Reasons for reliability loss Factors at test administration Rating of responses Guessing Selection of items for the test Variation in individuals true scores

Ways to determine reliability To determine reliability we would like to be able to compute the correlation between the test and itself, or to know the true scores and the error scores. This is not possible, so different approaches have been devised: Test retest: Administer the same test twice (memory effects may be a problem; sensitive to temporal instability, but not to effects of item selection)" Parallel test: Create an identical twin of the test (sensitive to effects of item selection; may or may not be sensitive to temporal instability)" Split-half: Create two parallel tests by randomly splitting the items into two groups (sensitive to effects of item selection; not sensitive to temporal instability). The splithalf correlation gives the reliability for a half test, and to get it for the full test it needs to be corrected with the Spearman-Brown prophecy formula."

Ways to determine reliability, cont Cronbach s α " A measure of internal consistency among items" Sensitive to effects of item selection; not sensitive to temporal instability" The mean of all possible split-half coefficients" Increases as a function of the correlation among the items and as a function of the number of items" " α = (k/(k-1))*[1-σ(var(itemscores))/var(totscore)]

Computation of Cronbach s α with SPSS RELIABILITY /VARIABLES=nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r /SCALE('ALL VARIABLES') ALL /MODEL=ALPHA.

Reliability and test length The reliability increases as a function of the test length according to the Spearman-Brown prophecy formula : If we increase our minitest 6.5 times to 65 items we expect a reliability of r(6.5) = 6.5*.737/(1+5.5*.737) =.948 If we compute Cronbach s α from the 65 items we obtain:

Reliability as a function of test length

Split-half reliability (first 33 items versus last 32 items) This analysis yields a much lower reliability estimate than Cronbach s α did! Cronbach's Alpha Reliability Statistics Part 1 Value.888 N of Items 33 Part 2 Value.913 N of Items 32 Total N of Items 65.808 Equal Length Unequal Length.894.894.887 Correlation Between Forms Spearman- Brown Guttman Coefficient Split-Half Coefficient

Cronbach s α for passage scores This analysis too yields a much lower reliability estimate than Cronbach s α for item scores did!

Cronbach s α assumptions 1. All components measure the same underlying dimension 2. All components have the same relation to the underlying dimension 3. All components have the same error variance If assumptions 2 and 3 are violated α will underestimate reliability but will provide a lower bound to reliability If assumption 1 is violated, α misestimates reliability, but we also run into interpretational difficulties We need methods to test these assumptions

The Birds passage

A congeneric latent variable model for the items in the Birds passage

Estimating the model Needed: estimates of 10 parameters (4 regression coefficients, 5 error variances, 1 variance of the latent variable). Available: 15 elements of the covariance matrix (10 covariances and 5 variances). Express the known entities in terms of the unknown parameters through application of path rules, e. g.: Cov(NBIRD4R, NBIRD5R) = b4 Var(nbird) b3" Cov(NBIRD2R, NBIRD1R) = b1 Var(nbird) 1" Var(NBIRD2R) = b1 Var(nbird) b1 + 1 Var(NBIRD2R&) 1" Solve the 15 equations for the 10 unknown parameters.

Unstandardized parameter estimates Standardized parameter estimates

Does the model fit the data? Reproduce the covariance matrix from the estimated parameters (the implied matrix) and compare it with the observed matrix, e.g.: Cov(NBIRD4R, NBIRD5R) = 0.54 x 0.05 x 1.20 = 0.032 (observed value = 0.035) Cov(NBIRD1R, NBIRD2R) = 1.00 x 0.05 x 0.74 = 0.037 (observed value = 0.037) A chi-square test of model fit may be computed: Chi-square = 110.40, df = 5, p <.00

Problems with the Chi-square Goodness of Fit test The test is χ 2 distributed only when data has a multivariate normal distribution under maximum likelihood estimation. When the sample size is large even trivial deviations between model and data cause the χ 2 test to be significant. When the sample size is small even important deviations from the true model may be undetected. A model with many free parameters has a better χ 2 value than a model with few free parameters. However, models with few free parameters are generally to be preferred over models with many free parameters.

The Root Mean Square Error of Approximation (RMSEA) The RMSEA measures the amount of discrepancy between model and data in the population, taking model complexity (i. e., number of estimated parameters) into account. Values less than 0.05 indicate good fit, and values up to 0.07 or 0.08 may be accepted." The Test of Close Fit tests the hypothesis that RMSEA < 0.05." A 90 % confidence interval of RMSEA may be contructed. The lower limit of interval should be less than.05 and the upper limit of the interval should not be higher than 0.10. " The nbird model: RMSEA = 0.064, 90 % CI 0.054 0.075

A parallel latent variable model for the items in the Birds passage

Unstandardized parameter estimates Standardized parameter estimates Chi-square = 2621.42, df = 13, RMSEA = 0.226, 90 % CI 0.219 0.234

Reliability calculations ρ for a single item = 0.04/(0.04 + 0.13) = 0.23 ρ for 5 items according to S-B = 5*0.23/(1+4*0.23) = 0.599 ρ for the total score can also be computed with formula: ρ = Latvar * Σ(b i ) 2 /(Latvar * Σ(b i ) 2 + Σ Resvar i ) Parallell model: ρ = 0.04*25/(0.04*25 + 5*0.134) = 0.599 Congeneric model: 0.049*4.468 2 /(0.049*4.468 2 + 0.657) = 0.598

Some conclusions Cronbach s α is based on several assumptions, and one of them is that items should be identical in terms of relation to the latent variable ( discrimination ) and residual variance (the parallelity assumption). However, our results indicate that estimation of α is robust against deviations from the parallelity assumption. This seems to be quite a general finding. The unidimensionality assumption is another assumption, which we now turn to.

A two-dimensional model for the passages Bird and Shark

Standardized estimates for the two-dimensional model Chi-square 405.90, df = 35, RMSEA = 0.046, 90 % CI 0.042 0.049

Standardized estimates for a one-dimensional model Chi-square = 613.72, df = 36, RMSEA = 0.056, 90 % CI 0.052 0.060

An orthogonal model with general and specific factors

Results from a model with one general factor and 15 passage factors Fit statistics for the one-dimensional model: Chi-square = 33835.66, df = 2015, RMSEA = 0.056, 90 % CI 0.055 0.056 Fit statistics for a model with one general factor and 15 passage factors: Chi-square = 12888.18, df = 2000, RMSEA = 0.032, 90 % CI 0.032 0.033 Estimated components of variance in sum of scores READGEN 152.77 ECARD 0.03 NBIRD 0.33 DISLA 0.38 DMARIA 0.12 NDOG 0.64 EWLR 1.66 ESND 0.13 NSHK 0.38 DBTT 0.37 DBUS 0.35 DCNT 0.34 DTEMP 0.30 EMRM 0.30 NGRP 1.44 ETRE 1.84 Error 7.58 Estimated total variance 168.94 Sum of passage variances 8.61 Estimated systematic variance 161.38 Estimated total reliability 0.955 Estimated reliability for ReadGen 0.904

Definition of validity Does the instrument measure what it intends to measure? Validity is an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989, p 5).

The three classical forms of validity Content validity. How well do the items in a test cover a certain domain? Criterion-related validity. How well does the test predict a criterion? Construct validity. How well does the test function as an indicator of a construct?

Construct validity as the overarching validity construct Content validity and criterion-related validity are insufficient forms of validity and require construct validity. This has led to the view that construct validity is the only needed validity construct The meaning of construct validity has been broadened, particularly by Messick through introduction of consequential aspects of validity.

Threats against construct validity Construct underrepresentation. The instrument covers only parts of the construct, and leaves out important dimensions or facets. Construct-irrelevant variance. The instrumentet is influenced by sources of variation which have nothing to do with the construct.

Testing construct validity... test validation embraces all of the experimental, statistical, and philosophical means by which hypotheses and scientific theories are evaluated. (Messick, 1989, p 6).

Sources of information about construct validity Internal structure (explorative and confirmatory factor analysis) Relations with other variables (external structure) Assessment of content Studies of processes Differences over time and between groups Effects of experimental interventions Value implications and social consequences, concerning both intended and unintended effects

A three-dimensional model for the RL-test

Messick s progressive matrix