Basic concepts and principles of classical test theory Jan-Eric Gustafsson
What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must be defined in theoretical terms. Measurement should be understood in a broad sense and encompasses also classification and assessment.
Why should one measure? Increases precision and comparability Gives access to a well developed set of tools for developing measurement instruments" collecting data" summarizing and describing data" analyzing data" making inferences and generalizations" However: Measurement suits certain purposes but not others
Classical and modern theories of measurement Theories of measurement to a large extent deal with how to put together components (items or subscales) into scales with known properties: Classical theories of measurement assume simple relations between the components and the dimension to be measured. Measures of test properties are typically group-dependent. " Modern theories of measurement (IRT) are based on probabilistic models of relations between item scores and characteristics of persons and item. This allows for estimation of ability from different items for different persons, and for estimating item characteristics which are invariant over groups of persons. "
An example: the IEA Reading Literacy study 1991 In 1991 some 4 500 Swedish students in grade 3 participated in a study of reading literacy along with samples of students in about 30 other countries (RL 1991). A large number of instruments: Reading literacy tests: 15 texts from three categories and 66 multiple choice items." Student questionnaire: Questions about home background, attitudes towards reading, and reading habits." Parental questionnaire: Literacy activities in the home, economic and cultural resources, reading habits and attitudes, relations between home and school, education and occupation." Teacher questionnaire: Questions about the class, the teaching of reading, resources and the teacher." School questionnaire: Characteristics of the school, resources, school climate, and relations between home and school."
Starting points for the construction of the reading literacy test Definition: Reading literacy is the ability to understand and use such forms of written language which are required in society and/or are of value for the individual Requirements on the texts: The students should not have met them before" The texts should be possible to use again after 10 years" The texts should be appropriate for all countries, languages, ethnic and socioeconomic groups and both genders." They should be possible to use stand alone in such a way that they could provide a meaningful reading experience" They should not be formulated in such general terms that the students would be able to answer the questions without reading the texts" They should comprise different levels of difficulty"
The reading literacy test Three types of texts Narrative prose. These are continous texts which aim to tell a story. The texts typically follow a linear time sequence, and are usually intended to entertain or to involve the reader emotionally. The texts ranged in length from short fables to longer stories." Expository prose. This category comprises continuous texts which aim to convey factual information or opinion to the reader. " Documents. These are structured presentations of information, in the form of graphs, charts, maps, lists, or sets of instructions. The reader can process the information in a nonlinear fashion without reading the whole text, and typically the number of words is limited." Items" In relation to each text between two and six questions were asked." Altogether there were 66 multiple-choice items and two open-ended questions. The latter were not included in the analysis because of too low inter-rater agreement." Booklets The 15 texts and the 66 multiple-choice items were distributed over two booklets (A and B)."
Test components This test, and most other tests, thus consists of different types of components: Single questions (items), which here are scored 0 (incorrect choice) and 1 (correct choice)" Text passages (testlets, parcels, or item bundles) with between 2 and 6 as the miximum score." Booklets" If test components (items) are independent, there is more flexibility and power in designing tests than if there are dependencies among the components."
A minitest A minitest has been created from the items (10) to two of the narrative texts ( Bird and Shark )
Statistical measures for the minitest Descriptive Statistics minitest Valid N (listwise) N Minimum Maximum Mean Std. Deviation 5101.00 10.00 7.2151 2.26547 5101
Distribution of scores for the minitest
Means and standard deviations of the items Item Statistics nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r Mean Std. Deviation N.57.496 5101.78.417 5101.51.500 5101.83.375 5101.94.239 5101.88.323 5101.83.378 5101.39.488 5101.70.457 5101.79.407 5101
Correlations among the items Inter-Item Correlation Matrix nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r 1.000.181.288.303.181.217.187.250.236.227.181 1.000.171.275.206.219.167.145.164.191.288.171 1.000.292.186.205.184.219.216.214.303.275.292 1.000.388.317.249.205.236.303.181.206.186.388 1.000.278.199.109.173.230.217.219.205.317.278 1.000.341.189.245.313.187.167.184.249.199.341 1.000.201.241.221.250.145.219.205.109.189.201 1.000.253.239.236.164.216.236.173.245.241.253 1.000.309.227.191.214.303.230.313.221.239.309 1.000
Relations between items and the total score Item-Total Statistics nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r Scale Corrected Squared Cronbach's Scale Mean if Variance if Item-Total Multiple Alpha if Item Item Deleted Item Deleted Correlation Correlation Deleted 6.65 4.054.417.184.714 6.44 4.388.327.120.727 6.70 4.087.394.165.719 6.38 4.218.502.296.702 6.28 4.691.372.195.725 6.33 4.417.450.239.712 6.39 4.383.384.176.719 6.83 4.167.366.146.723 6.51 4.154.413.183.714 6.42 4.226.443.214.710
Reliability The precision of an instrument How well an instrument resists the influence of random variation Does the instrument give the same result upon repeated measurements?
Definition of reliability Observed score = True score + Error An instruments correlation with itself The ratio between true score variance and observed score variance (observed score variance = true score variance + error variance) What is a true score and what is error?
Sources of variance in test scores (after Thorndike, 1951) Individual characteristics External/situational factors
Reasons for reliability loss Factors at test administration Rating of responses Guessing Selection of items for the test Variation in individuals true scores
Ways to determine reliability To determine reliability we would like to be able to compute the correlation between the test and itself, or to know the true scores and the error scores. This is not possible, so different approaches have been devised: Test retest: Administer the same test twice (memory effects may be a problem; sensitive to temporal instability, but not to effects of item selection)" Parallel test: Create an identical twin of the test (sensitive to effects of item selection; may or may not be sensitive to temporal instability)" Split-half: Create two parallel tests by randomly splitting the items into two groups (sensitive to effects of item selection; not sensitive to temporal instability). The splithalf correlation gives the reliability for a half test, and to get it for the full test it needs to be corrected with the Spearman-Brown prophecy formula."
Ways to determine reliability, cont Cronbach s α " A measure of internal consistency among items" Sensitive to effects of item selection; not sensitive to temporal instability" The mean of all possible split-half coefficients" Increases as a function of the correlation among the items and as a function of the number of items" " α = (k/(k-1))*[1-σ(var(itemscores))/var(totscore)]
Computation of Cronbach s α with SPSS RELIABILITY /VARIABLES=nbird1r nbird2r nbird3r nbird4r nbird5r nshak1r nshak2r nshak3r nshak4r nshak5r /SCALE('ALL VARIABLES') ALL /MODEL=ALPHA.
Reliability and test length The reliability increases as a function of the test length according to the Spearman-Brown prophecy formula : If we increase our minitest 6.5 times to 65 items we expect a reliability of r(6.5) = 6.5*.737/(1+5.5*.737) =.948 If we compute Cronbach s α from the 65 items we obtain:
Reliability as a function of test length
Split-half reliability (first 33 items versus last 32 items) This analysis yields a much lower reliability estimate than Cronbach s α did! Cronbach's Alpha Reliability Statistics Part 1 Value.888 N of Items 33 Part 2 Value.913 N of Items 32 Total N of Items 65.808 Equal Length Unequal Length.894.894.887 Correlation Between Forms Spearman- Brown Guttman Coefficient Split-Half Coefficient
Cronbach s α for passage scores This analysis too yields a much lower reliability estimate than Cronbach s α for item scores did!
Cronbach s α assumptions 1. All components measure the same underlying dimension 2. All components have the same relation to the underlying dimension 3. All components have the same error variance If assumptions 2 and 3 are violated α will underestimate reliability but will provide a lower bound to reliability If assumption 1 is violated, α misestimates reliability, but we also run into interpretational difficulties We need methods to test these assumptions
The Birds passage
A congeneric latent variable model for the items in the Birds passage
Estimating the model Needed: estimates of 10 parameters (4 regression coefficients, 5 error variances, 1 variance of the latent variable). Available: 15 elements of the covariance matrix (10 covariances and 5 variances). Express the known entities in terms of the unknown parameters through application of path rules, e. g.: Cov(NBIRD4R, NBIRD5R) = b4 Var(nbird) b3" Cov(NBIRD2R, NBIRD1R) = b1 Var(nbird) 1" Var(NBIRD2R) = b1 Var(nbird) b1 + 1 Var(NBIRD2R&) 1" Solve the 15 equations for the 10 unknown parameters.
Unstandardized parameter estimates Standardized parameter estimates
Does the model fit the data? Reproduce the covariance matrix from the estimated parameters (the implied matrix) and compare it with the observed matrix, e.g.: Cov(NBIRD4R, NBIRD5R) = 0.54 x 0.05 x 1.20 = 0.032 (observed value = 0.035) Cov(NBIRD1R, NBIRD2R) = 1.00 x 0.05 x 0.74 = 0.037 (observed value = 0.037) A chi-square test of model fit may be computed: Chi-square = 110.40, df = 5, p <.00
Problems with the Chi-square Goodness of Fit test The test is χ 2 distributed only when data has a multivariate normal distribution under maximum likelihood estimation. When the sample size is large even trivial deviations between model and data cause the χ 2 test to be significant. When the sample size is small even important deviations from the true model may be undetected. A model with many free parameters has a better χ 2 value than a model with few free parameters. However, models with few free parameters are generally to be preferred over models with many free parameters.
The Root Mean Square Error of Approximation (RMSEA) The RMSEA measures the amount of discrepancy between model and data in the population, taking model complexity (i. e., number of estimated parameters) into account. Values less than 0.05 indicate good fit, and values up to 0.07 or 0.08 may be accepted." The Test of Close Fit tests the hypothesis that RMSEA < 0.05." A 90 % confidence interval of RMSEA may be contructed. The lower limit of interval should be less than.05 and the upper limit of the interval should not be higher than 0.10. " The nbird model: RMSEA = 0.064, 90 % CI 0.054 0.075
A parallel latent variable model for the items in the Birds passage
Unstandardized parameter estimates Standardized parameter estimates Chi-square = 2621.42, df = 13, RMSEA = 0.226, 90 % CI 0.219 0.234
Reliability calculations ρ for a single item = 0.04/(0.04 + 0.13) = 0.23 ρ for 5 items according to S-B = 5*0.23/(1+4*0.23) = 0.599 ρ for the total score can also be computed with formula: ρ = Latvar * Σ(b i ) 2 /(Latvar * Σ(b i ) 2 + Σ Resvar i ) Parallell model: ρ = 0.04*25/(0.04*25 + 5*0.134) = 0.599 Congeneric model: 0.049*4.468 2 /(0.049*4.468 2 + 0.657) = 0.598
Some conclusions Cronbach s α is based on several assumptions, and one of them is that items should be identical in terms of relation to the latent variable ( discrimination ) and residual variance (the parallelity assumption). However, our results indicate that estimation of α is robust against deviations from the parallelity assumption. This seems to be quite a general finding. The unidimensionality assumption is another assumption, which we now turn to.
A two-dimensional model for the passages Bird and Shark
Standardized estimates for the two-dimensional model Chi-square 405.90, df = 35, RMSEA = 0.046, 90 % CI 0.042 0.049
Standardized estimates for a one-dimensional model Chi-square = 613.72, df = 36, RMSEA = 0.056, 90 % CI 0.052 0.060
An orthogonal model with general and specific factors
Results from a model with one general factor and 15 passage factors Fit statistics for the one-dimensional model: Chi-square = 33835.66, df = 2015, RMSEA = 0.056, 90 % CI 0.055 0.056 Fit statistics for a model with one general factor and 15 passage factors: Chi-square = 12888.18, df = 2000, RMSEA = 0.032, 90 % CI 0.032 0.033 Estimated components of variance in sum of scores READGEN 152.77 ECARD 0.03 NBIRD 0.33 DISLA 0.38 DMARIA 0.12 NDOG 0.64 EWLR 1.66 ESND 0.13 NSHK 0.38 DBTT 0.37 DBUS 0.35 DCNT 0.34 DTEMP 0.30 EMRM 0.30 NGRP 1.44 ETRE 1.84 Error 7.58 Estimated total variance 168.94 Sum of passage variances 8.61 Estimated systematic variance 161.38 Estimated total reliability 0.955 Estimated reliability for ReadGen 0.904
Definition of validity Does the instrument measure what it intends to measure? Validity is an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989, p 5).
The three classical forms of validity Content validity. How well do the items in a test cover a certain domain? Criterion-related validity. How well does the test predict a criterion? Construct validity. How well does the test function as an indicator of a construct?
Construct validity as the overarching validity construct Content validity and criterion-related validity are insufficient forms of validity and require construct validity. This has led to the view that construct validity is the only needed validity construct The meaning of construct validity has been broadened, particularly by Messick through introduction of consequential aspects of validity.
Threats against construct validity Construct underrepresentation. The instrument covers only parts of the construct, and leaves out important dimensions or facets. Construct-irrelevant variance. The instrumentet is influenced by sources of variation which have nothing to do with the construct.
Testing construct validity... test validation embraces all of the experimental, statistical, and philosophical means by which hypotheses and scientific theories are evaluated. (Messick, 1989, p 6).
Sources of information about construct validity Internal structure (explorative and confirmatory factor analysis) Relations with other variables (external structure) Assessment of content Studies of processes Differences over time and between groups Effects of experimental interventions Value implications and social consequences, concerning both intended and unintended effects
A three-dimensional model for the RL-test
Messick s progressive matrix